11:45 Low lost communication with its emerald boards.
rebooted low and everything back up.
Information for Gordon:
root@low root# irqs
Counting interrupts over 5 seconds ...
IRQ Interrupt Type Total Int Int/sec
------------------------------------------------------
24: GPIO-l eth0: 2 0.4
36: SC serial: 15 3
37: SC serial: 102 20.4
42: SC ost0: 508 101.6
114: GPIO isp116x-hcd:usb1: 290 58
115: GPIO serial: 226 45.2
116: GPIO serial: 103 20.6
end of dmesg
i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
i2c i2c-0: i2c_pxa: timeout waiting for bus free
handle_IRQ_event called 4 times for IRQ 3
handle_IRQ_event called 4 times for IRQ 3
root@low root#
root@low root#
Added by Gordon, Jun 26:
/var/log/isfs/kernel has those dmesg messages, with timetags
Jun 23 16:49:50 low kernel: i2c i2c-0: i2c_pxa: timeout waiting for bus free Jun 23 16:49:53 low last message repeated 5 times Jun 25 09:27:58 low kernel: handle_IRQ_event called 4 times for IRQ 3 Jun 25 17:46:14 low kernel: handle_IRQ_event called 4 times for IRQ 3 |
Those were the only messages before the reboot, and they occurred at least 23 hours earlier, which means the problem is not due to a kernel oops, or any other atypical event that the kernel could detect. It is just the good ol' situation where there seems to be a very small but finite possibly that a PC104 interrupt can be missed, and not retriggered, even though the PC104 IRQ interrupt line is high, such that the interrupt handler is never again called.
I believe restarting the dsm process with a ddn/dup, which closes and re-opens the serial ports, will bring it back too
I just updated the xml on the low DSM so that every sensor has a timeout. The dsm process should then close and reopen each port after detecting the timeout, which should also help to recover from this situation more quickly.
Seems that I need to install a PC104 interrupt watchdog module. There is some indication this has happened on the aircraft, also quite infrequently. A test is being setup out at RAF.
When the PC104 interrupts are being handled, the irqs listing looks like so, showing 275 interrupts/sec from the Emerald cards:
root@low root# irqs Counting interrupts over 5 seconds ... IRQ Interrupt Type Total Int Int/sec ------------------------------------------------------ 3: ISA serial: 1376 275.2 24: GPIO-l eth0: 62 12.4 25: GPIO-l GPIO1-PC104: 1376 275.2 36: SC serial: 15 3 37: SC serial: 101 20.2 42: SC ost0: 509 101.8 114: GPIO isp116x-hcd:usb1: 90 18 115: GPIO serial: 228 45.6 116: GPIO serial: 102 20.4 |