Netcdf files hadn't been updated at EOL since this morning.
data_stats sock:barolo showed no data.
nidas_udp_relay was running on eol-rt-data, though it had been restarted this morning at 09:50 MDT. It wasn't due to a reboot, it's been up for 50 days.
On eol-rt-data, "data_stats sock::30010" showed data coming in. Or you can do, from any system at EOL:
data_stats sock:eol-rt-data.fl-ext.ucar.edu:30010
These errors started showing up in /var/log/isfs/isfs.log, every 10 seconds:
Sep 25 09:41:10 barolo dsm_server[44405]: ERROR|SocketConnectionThread: IOException: inet:128.117.188.122:30010: connect: Connection refused
Eventually the socket open succeeded, but then this error:
Sep 25 09:50:11 barolo dsm_server[44405]: WARNING|SampleInputStream: inet:128.117.188.122:30010: raw sample not of type char(0): #bad=1,filepos=0,id=(609,25461),type=28,len=779247971
As with reading disk data, the reader skips forward one byte and looks for a good sample.
Not sure why the corrupt data, and why it didn't recover. Would be good to look at the logs on eol-rt-data.
Did a kill -TERM of dsm_server on barolo, and ran check_vertex_procs.sh by hand, rather than waiting for crontab.
Updated crontab to check the procs every 15 minutes, rather than 30.
2 Comments
Steve Oncley
Part of this mess was me. I tried to kill dsm_server on eddy while I was flailing away at getting /scr/isfs mounted. (I thought it wouldn't be nice to unmount this disk while raw_data files were being written to it.) I wanted to stop dsm_server, not restart it, so I didn't use the script and just used kill. However, dsm_server restarted itself immediately. I ended up doing a bunch of other stuff rather than unmount the disk.
I guess I should mention that, before I worked on it this morning, /scr/isfs (where both VERTEX/netcdf and /raw_data are) had been offline with I/O errors (that I <may> have fixed) for at least a day.
Gordon Maclean AUTHOR
Yea, systemd restarts dsm_server if it exits. Perhaps there's a way to configure it so that it won't restart it after a specific signal, such as TERM. Looks like the setting is "Restart=on-abnormal" instead of "on-failure"..
To stop it so that systemd won't restart it:
systemctl --user stop dsm_server
To start it again, substitute "start" for "stop".
Since it's a user process, don't need sudo.