There were many failing nagios checks on iss3, but I think the problem was that the catalog database was not updating. My theory is that the ceilometer PC changed the mtime property on some (many) of the camera images, and rsync copied those mtime changes to the data manager, and that caused the catalog to think all those images had changed and needed to be rescanned. There are about 32000 camera images, and it looks like 20,000 had to be rescanned.
After finally completing the rescan, subsequent rescans are now much faster again, and the only checks failing now are for the ceilometer and allsky camera. So there is still an issue with the ceilometer PC. Right now rsync connections to it are failing.
6 Comments
John Sobtzak
When did the errors show up in nagios for ISS 3? I was checking it all day and didn't see any errors as of last evening. Am I not checking the right nagios thing?
Gary Granger AUTHOR
Yeah, good question, I should have noted that. It looks like the camera image downloads started getting flaky around 8 pm PT, but nagios was flipping back and forth between green and red until about 9:40 PT, when it went solid red. The other camera image categories and ceilometer data coming from the ceilometer PC look similar. I'm sure you're checking the right nagios (http://iss3-field.dyndns.org/nagios/), it's probably just that things were in and out. I noticed because I was looking at the water vapor problem on iss2 and decided to check the other sites as well.
Gary Granger AUTHOR
I wish I had some idea what was going on with this ceilometer PC and rsync, but all I can do is guess. The mtime on the images is now in 2006, so that seems very peculiar. Maybe the system time just got out of whack and that has messed up the software or the rsync server. Or maybe there are issues with the disk or filesystem. If the Internet PDU is out there somewhere, maybe we should plug the ceilometer PC into it so we can at least power cycle it remotely. But then we'd still need to get VNC working on it to be able to connect and manually restart the ceilometer software after the reboot....so maybe it's still easiest to just keep manually rebooting it occasionally. Have I painted a subtle enough picture about how obnoxious this PC is, or should I be more blunt?
John Sobtzak
Do you need me to go out there and manually reboot the ceilometer computer? I can head out there at 1 PM PDT, after a telecon I have.
John Sobtzak
Laura was out at ISS3 today and upon arrival saw this message on the profiler computer screen:
It had appeared that he hard drive had filled up, or something similar. Bill had Laura empty out the C:/temp directory, reboot the machine, and restart the ceilometer software. The error in Nagios have cleared and remained so for the rest of the day.
William Brown
To my surprize, I found that I can connect to the ceilometer computer via vncviewer so if it reboots, I can hopefully connect and restart the processes remotely.