After seeing Jacquie's blog post yesterday about no data I did some troubleshooting (see comment on Jacquie's post) but only succeeded in taking the DSM off the net entirely. I went out to Marshall to reboot it twice, each time bringing it up briefly but then it would go down again. Went back for a longer visit in the afternoon with Gary (thanks Gary!) to do more troubleshooting.
The DSM always came up after reboot, but wouldn't stay up very long. When we were logged in over the console, we could see kernel error messages before it went down. In one case we saw that nidas was running correctly just after rebooting, but eventually stopped (couldn't find anything relevant in the logs) and had to be restarted with dup. Seems like this is the same problem as what Jacquie noted originally. We also sometimes saw error messages about I/O errors trying to access the usb stick when starting nidas or using the lsu command.
We did some testing of the pio command, and it seems that using pio -v to see the current state sometimes turns power off to some or all of the ports. pio can turn off power to bank1 (powering the switch) and 28v (powering the ubiquiti), either of which would take the dsm off the net when you're logged in remotely. Until we can fix the pio -v behavior we unplugged the cable between the autoconfig board and the power panel, so using pio -v shouldn't be able to turn off power to bank1 or 28v.
We also noticed that it was taking a very long time for the dsm to get the correct time after reboot (sometimes like half an hour). It turns out that pps0 wasn't giving any data, it was on pps1 instead. We tried modifying chrony.conf and disconnecting the pps, but neither seemed to do much (Gary, you can add stuff if you remember more of the details).
Eventually we decided that the DSM wasn't reliable enough to leave up because of all the kernel errors, so we replaced the pi with one from a spare DSM we brought. That worked, and Gary made some updates remotely:
I had to install xinetd on the DSM to get check_mk to work from snoopy, and then nagios noticed that the USB filesystem was missing. So I decided to upgrade the whole DSM before rebooting it, and it turns out there were quite a few updates. The USB came back and has not failed all night, but it's good that it's getting checked now.
I also modified the chrony.conf to add the iburst option to the tardis server line. That makes chrony sync much faster to the network NTP server, and it makes sense to rely on that since the PPS is not working.
Finally, I discovered a bug with the data_stats and dashbaoard, where the data_stats will just keep counting up from 2010 if the system time was wrong when data_stats started. The quick workaround was to restart the json_data_stats service after the system time was correct.
The DSM went down again at 8:20 this morning. Dan is at Marshall this morning and rebooted the DSM around 11:30, which brought it back on the net. I looked through the logs and it seems like this time the DSM was collecting data while it was off the net, so it may have just been an ethernet issue. DSM is back off the net now though, about 10 minutes after Dan rebooted it.
Since the DSM clearly still isn't reliable, I'm thinking the next step would be setting up a whole new dsm + SD card as tt when I'm in the office tomorrow...
3 Comments
Gary Granger
Now that nagios is monitoring both tt and ttstation, I am perplexed that both have gone down at the same times. The power relay for ttstation is not connected, so even if the Pi hangs or the ethernet interface drops away, ttstation should stay powered up and keep responding to pings. I think the only way ttstation stops responding to pings is if the DSM power is turned off, but from what I've heard no one has been turning the DSM off and leaving it off, only rebooting it. Below are screenshots of ttstation and tt from nagios. Is it possible the power is unstable?
Isabel Suhr AUTHOR
Currently (1:57 friday afternoon) nagios is showing ttstation as down but I can ping it and browse to the web interface just fine. So I'm not sure the nagios correlation between ttstation being down and tt being down is actually happening...
(tt is down right now, though)
Gary Granger
So disregard the comments above about ttstation. The ip address for ttstation was wrong in nagios. That's been fixed, and now ttstation shows up. I should have thought to check that when the tt and ttstation downtimes correlated so closely!