From the logs of the check_trh process on flux I see these entries since it was started on April 9. For some reason the higher TRHs had some issues yesterday.
Times in MDT: fgrep cycling /var/log/messages* Apr 18 18:49:09 flux check_trh.sh: 300m temperature is 137.88 . Power cycling port 5 Apr 23 13:03:58 flux check_trh.sh: 300m temperature is 174.1 . Power cycling port 5 Apr 23 13:06:58 flux check_trh.sh: 300m temperature is 174.28 . Power cycling port 5 Apr 23 13:08:18 flux check_trh.sh: 200m temperature is 181.61 . Power cycling port 5 Apr 23 13:08:38 flux check_trh.sh: 300m temperature is 174.28 . Power cycling port 5 Apr 23 13:08:58 flux check_trh.sh: 200m temperature is 181.53 . Power cycling port 5 Apr 23 13:09:48 flux check_trh.sh: 300m temperature is 174.06 . Power cycling port 5 Apr 23 13:16:48 flux check_trh.sh: 200m temperature is 179.15 . Power cycling port 5 Apr 23 13:19:18 flux check_trh.sh: 250m temperature is 173.33 . Power cycling port 5 Apr 23 13:30:38 flux check_trh.sh: 200m temperature is 177.04 . Power cycling port 5 Apr 23 13:48:38 flux check_trh.sh: 250m temperature is 171.63 . Power cycling port 5 Apr 23 13:50:48 flux check_trh.sh: 250m temperature is 173.08 . Power cycling port 5
Yesterday (April 23) I reworked things so that the check script is run on each DSM, including the bao station. The only entries after that are from 300m. Subtracting 6 hours from the times, these are at 13:27-13:29 MDT
Times in UTC ssh 300m fgrep cycling /var/log/isfs/dsm.log Apr 23 19:27:33 300m root: temperature is -62.52 . Power cycling port 5 Apr 23 19:28:25 300m root: temperature is -62.52 . Power cycling port 5 Apr 23 19:29:41 300m root: temperature is -62.52 . Power cycling port 5
For example, here is the hiccup from 200m at 19:30:22 UTC. Note after the first power cycle, things look good for 5 seconds, then it reports a bad temp of 89.92 at 19:30:50.1491 and is power cycled again, and works after that.
data_dump -i 4,20 -A 200m_20150423_160000.dat | more ... 2015 04 23 19:30:17.3598 1.001 37 TRH30 15.13 27.28 34 0 1377 56 107\r\n 2015 04 23 19:30:18.3691 1.009 37 TRH30 15.09 27.28 33 0 1376 56 105\r\n 2015 04 23 19:30:19.3691 1 37 TRH30 15.13 27.28 34 0 1377 56 108\r\n 2015 04 23 19:30:20.3692 1 37 TRH30 15.09 27.28 33 0 1376 56 103\r\n 2015 04 23 19:30:21.3790 1.01 37 TRH30 15.09 27.28 34 0 1376 56 108\r\n 2015 04 23 19:30:22.6191 1.24 40 TRH30 177.00 260.02 36 0 5510 886 112\r\n 2015 04 23 19:30:23.6290 1.01 40 TRH30 177.00 260.18 35 0 5510 885 109\r\n 2015 04 23 19:30:24.6290 1 40 TRH30 177.04 260.21 34 0 5511 885 106\r\n 2015 04 23 19:30:25.6398 1.011 40 TRH30 177.08 260.40 33 0 5512 884 105\r\n ... 2015 04 23 19:30:37.6898 1.001 40 TRH30 177.26 260.19 32 0 5517 886 102\r\n 2015 04 23 19:30:38.6900 1 40 TRH30 177.23 260.33 34 0 5516 885 108\r\n 2015 04 23 19:30:39.6991 1.009 38 TRH30 177.30 260.87 5 0 5518 882 16\r\n 2015 04 23 19:30:43.7398 4.041 2 \n 2015 04 23 19:30:43.7408 0.001042 80 \r Sensor ID30 I2C ADD: 11 data rate: 1 (secs) fan(0) max current: 80 (ma)\n 2015 04 23 19:30:43.8292 0.08842 44 \rresolution: 12 bits 1 sec MOTE: off\r\n 2015 04 23 19:30:43.8806 0.05133 28 calibration coefficients:\r\n 2015 04 23 19:30:43.9098 0.02924 21 Ta0 = -4.129395E+1\r\n 2015 04 23 19:30:43.9398 0.02995 21 Ta1 = 4.143320E-2\r\n 2015 04 23 19:30:43.9691 0.02937 21 Ta2 = -3.293163E-7\r\n 2015 04 23 19:30:43.9899 0.02073 21 Ha0 = -7.786594E+0\r\n 2015 04 23 19:30:44.0191 0.02922 21 Ha1 = 6.188832E-1\r\n 2015 04 23 19:30:44.0449 0.02582 21 Ha2 = -5.069766E-4\r\n 2015 04 23 19:30:44.0691 0.02418 21 Ha3 = 9.665616E-2\r\n 2015 04 23 19:30:44.0991 0.03 21 Ha4 = 6.398342E-4\r\n 2015 04 23 19:30:44.1191 0.02001 21 Fa0 = 3.222650E-1\r\n 2015 04 23 19:30:45.1098 0.9907 37 TRH30 15.17 26.14 32 0 1378 54 102\r\n 2015 04 23 19:30:46.1191 1.009 37 TRH30 15.17 26.14 33 0 1378 54 103\r\n 2015 04 23 19:30:47.1291 1.01 37 TRH30 15.17 26.14 34 0 1378 54 108\r\n 2015 04 23 19:30:48.1290 0.9999 37 TRH30 15.17 26.14 32 0 1378 54 101\r\n 2015 04 23 19:30:49.1390 1.01 37 TRH30 15.17 26.14 33 0 1378 54 105\r\n 2015 04 23 19:30:50.1491 1.01 32 TRH30 89.92 0.90 0 0 3251 0 0\r\n 2015 04 23 19:30:53.5790 3.43 2 \n 2015 04 23 19:30:53.5801 0.001042 80 \r Sensor ID30 I2C ADD: 11 data rate: 1 (secs) fan(0) max current: 80 (ma)\n 2015 04 23 19:30:53.6699 0.08981 44 \rresolution: 12 bits 1 sec MOTE: off\r\n 2015 04 23 19:30:53.7213 0.05139 28 calibration coefficients:\r\n 2015 04 23 19:30:53.7491 0.0278 21 Ta0 = -4.129395E+1\r\n 2015 04 23 19:30:53.7790 0.02995 21 Ta1 = 4.143320E-2\r\n 2015 04 23 19:30:53.8083 0.02925 21 Ta2 = -3.293163E-7\r\n 2015 04 23 19:30:53.8290 0.02075 21 Ha0 = -7.786594E+0\r\n 2015 04 23 19:30:53.8601 0.03103 21 Ha1 = 6.188832E-1\r\n 2015 04 23 19:30:53.8898 0.02971 21 Ha2 = -5.069766E-4\r\n 2015 04 23 19:30:53.9108 0.02107 21 Ha3 = 9.665616E-2\r\n 2015 04 23 19:30:53.9398 0.02892 21 Ha4 = 6.398342E-4\r\n 2015 04 23 19:30:53.9691 0.02932 21 Fa0 = 3.222650E-1\r\n 2015 04 23 19:30:54.9590 0.99 37 TRH30 15.17 26.14 34 0 1378 54 107\r\n 2015 04 23 19:30:55.9598 1.001 37 TRH30 15.17 26.14 33 0 1378 54 103\r\n 2015 04 23 19:30:56.9691 1.009 37 TRH30 15.21 26.15 34 0 1379 54 108\r\n 2015 04 23 19:30:57.9691 1 37 TRH30 15.17 26.14 33 0 1378 54 103\r\n
Notice the delta-T column after the datetime. I've looked at a few of these, and I think that there is always a larger deltat-T (in this case 1.24 sec instead of 1.0 ) at the time of the initial bad data, in case that might help in debugging.
9am, Apr 25: Some more glitches since yesterday. Notice again that the problems in different sensors seem to occur at approximately simultaneous times:
ck_trh 200m Apr 24 20:56:05 200m root: temperature is 170.15 . Power cycling port 5 Apr 24 20:57:10 200m root: temperature is 170.22 . Power cycling port 5 300m Apr 24 20:51:47 300m root: temperature is -62.52 . Power cycling port 5 Apr 24 20:56:52 300m root: temperature is -62.52 . Power cycling port 5