Blog from September, 2011

Monitoring the accuracy of the system clock on a real-time data acquisition system provides useful information about the performance of the system. Hence this long discussion.

Data System Clock, NTP and GPS

The data system at the Manitou Forest Observatory (aka, the DSM) uses a GPS receiver and the NTP (Network Time Protocol) software to set the system clock, which, in addition to the normal uses of a system clock, is used to time-tag the data samples.

The serial messages from the GPS are received on serial port 3, /dev/ttyS3. The pulse-per-second square-wave signal (PPS) from the GPS is also connected to the DCD line of that serial port. A patch has been added to the Linux kernel on the data system so that an interrupt function can be registered to run in response to the DCD interrupts. This interrupt function will be called immediately after the rising edge of the PPS signal has been detected by the serial port hardware.

The NTP software on the DSM runs a reference clock driver for a Generic NMEA GPS Receiver, with PPS. This driver reads the 1 second GPS RMC records from the serial port, and registers a function to be run on receipt of the PPS interrupt. NTP then uses these two sets of information to create a GPS reference clock. NTP then monitors the state of the GPS reference clock and the system clock, and makes gradual adjustments to the system clock to bring it to close agreement with the GPS clock.

The RMC records contain the current date and time, in addition to latitude, longitude, and other quantities. The transmission time of the RMC message is not tightly controlled within the GPS and appears to be primarily effected by lags associated with internal GPS processing, and is also likely effected by what other NMEA messages are enabled for output on the GPS. The exact receipt time of the RMC message is not used for clock adjustments. NTP simply uses the time fields within the RMC message as an absolute time label for the previous PPS, whose timing is very precise.

Clock Variables

We monitor the following variables to keep track of the DSM timekeeping, and plot them on the daily web plots:

  • GPSdiff: The time difference, in seconds, between the time-tag that was assigned to a RMC message and the date and time that is contained within the message. The time-tag assigned to a message sample is the value of the system clock at the moment the first byte of the message was received. For example, a value of 0.6 sec means that the data system assigned a time-tag to the RMC message that was 0.6 seconds later than the time value contained within the message. As discussed above, GPSdiff is not a precise measurement of clock differences and is not used to adjust the system clock. It gives a crude value of the agreement of the clocks and possible effects of I/O latency and buffering in the data system. When 5 minute statistics are computed, the maximum and minimum values of GPSdiff for each 5 minute period are written to the output NetCDF files as GPSdiff_max and GPSdiff_min.
  • GPSnsat: number of satellites being tracked by the receiver, that is, the number of satellites whose signals are used in its time and location solution. GPSnsat in the NetCDF files and plots is a 5 minute mean.

NTP on the DSM is configured to log its status in a "loopstats" file. See http://www.eecis.udel.edu/~mills/ntp/html/monopt.html for information on the NTP monitoring options. The loopstats file includes these variables, which have been merged into the Manitou data archive:

  • NTPClockOffset: the estimated offset of the GPS time from the data system time. A positive value indicates that NTP has determined that the GPS clock is ahead of the system clock, i.e. the GPS is showing a later time than the system clock. The maximum, minimum and mean values of NTPClockOffset in each 5 minute period are computed and written to the NetCDF files and plotted as NTPClockOffset_max, NTPClockOffset_min and NTPClockOffset.
  • NTPFreqOffset: the correction applied to the system clock frequency in parts-per-million. A positive value indicates that NTP has determined that the system clock oscillator is slow and the NTPFreqOffset PPM values are being added periodically to the system clock counter. The NetCDF files and plots contain 5 minute means of NTPFreqOffset.

The NTP logs have not been recorded consistently since the beginning of the project. Year 2010 data from May 3 to August 12th and Oct 14th to November 9th are available, as well as all data from April 9, 2011 onward.

Replacement of Garmin GPS

On April 12, 2011 the old Garmin GPS 25-HVS at the tower was replaced with a newer Garmin 18x-LVC model. The model numbers are shown in the $PGRMT messages in the archive, where the time is UTC:

data_dump -i 1,30 -A manitou_20110412_120000.bz2 | fgrep PGRMT
...
2011 04 12 16:41:39.6568    0.15      49 $PGRMT,GPS 25-HVS VER 2.50 ,P,P,R,R,P,,23,R*08\r\n
2011 04 12 16:42:50.4248  0.1249      51 $PGRMT,GPS 18x-LVC software ver. 3.10,,,,,,,,*6D\r\

Unexpectedly, the newer GPS provided much better time-keeping.

The following plot is for the old 25-HVS model for 3 days prior to the swap:

The NTPClockOffset shows spikes between -100000 to 50000 microseconds during this period, which is much worse than expected for a GPS/NTP reference clock. The spikes in NTPClockOffset are simultaneous with positive jumps in GPSdiff_max, up to as much as 2.5 seconds. These events seem to happen when the number of tracked satellites changes, which indicates that internal processing lags in the 25-HVS cause it to report late, causing large values of GPSdiff. The extent of this effect on the timing of the PPS signal is unknown.

The following plot shows a close up of one of the clock offset spikes using un-averaged data:

The sudden downward jump in NTPClockOffset causes NTP to think that the GPS clock is earlier than the system clock. NTP starts to correct for the offset by slowing down the system clock, as seen in the negative values for NTPFreqOffset. When the GPS recovers from its delayed reporting, NTP then sees positive values for NTPClockOffset and adjusts the system clock ahead.

After installing 18x-LVC, the NTPClockOffset is in a much improved range, from -70 to 25 microseconds. NTPFreqOffset is also in a much tighter range, indicating that NTP is applying smaller corrections to the system clock. GPSdiff is also much better behaved, ranging from a minimum of 0.5 to 1.1 seconds. The number of satellites tracked by the new GPS is also generally higher.

Temperature Effects

The frequency offset shows a temperature dependence in the system clock oscillator. We do not measure the temperature inside the data system enclosure, which is at the base of the tower. The nearest temperature measurement is of the ambient air at 2 meters up the tower. The top panel in the plot below shows a time series of the air temperature, along with NTPFreqOffset, for a cool 3 day period in April, after the installation of the new GPS. It appears that when the air temperature is below 5 deg C, the system clock oscillator does not show an obvious temperature relation.

The bottom panel shows a close relationship between the NTPClockOffset and the time derivative of NTPFreqOffset, which, I believe, indicates how NTP adjusts the system clock based on the measured offset. It also enforces the obvious conclusion that we could improve the time-keeping by insulating the CPU from temperature changes.

On a warmer 3 day period in July, where the temperatures were all above 5 degC, the temperature effect on the system clock oscillator is very evident.

Time Offsets During File Transfers

The periodic spikes in GPSdiff_max up to 1 second that occur at 23:00 local time and last about an hour, are simultaneous with the network transfer of the day's data files from the DSM to the RAL server. These suggest that increased sample buffering and latency is happening at these times, which needs to be investigated and improved.

A close-up of the file transfer on April 14, 23:00, plotted below, shows several events where NTPClockOffset first has a negative spike, indicating that NTP has determined that the GPS clock is behind the system clock and starts to slow down the system clock. These down spikes appear to be due to a delay in the response to a PPS interrupt. The interrupt latency appears to be short lived, because the NTPClockOffset becomes positive, and the system clock is re-adjusted. The April 14 transfer is shown in this plot:

In July, the clock behaviour during the file transfer is similar, but the initial increase in NTPClockOffset and a rising slope in NTPFreqOffset might be due to increased heating of the system clock oscillator, due to increased CPU load during the file transfers. Wild conjecture? After a quick scan of the web plots of 5 minute averages, I think these positive bumps in NTPFreqOffset seem to occur during file transfers when the outside air temperatures are above 0 C, and don't occur in colder conditions.

ppstest and ntpq

On the DSM, the ppstest program is helpful for gaining an understanding of the system and GPS clocks. It displays the system clock value when the interrupt function is called at the time of the assertion and the clear of the PPS signal. Do ctrl-C to terminate ppstest.

root@manitou root# ppstest /dev/ttyS3
trying PPS source "/dev/ttyS3"
found PPS source #3 "serial3" on "/dev/ttyS3"
ok, found 1 source(s), now start fetching data...
source 0 - assert 1315494544.999995675, sequence: 37249847 - clear  1315494544.099998000, sequence: 37249862
source 0 - assert 1315494544.999995675, sequence: 37249847 - clear  1315494545.099995000, sequence: 37249863
source 0 - assert 1315494545.999994675, sequence: 37249848 - clear  1315494545.099995000, sequence: 37249863
source 0 - assert 1315494545.999994675, sequence: 37249848 - clear  1315494546.099993000, sequence: 37249864
source 0 - assert 1315494546.999994675, sequence: 37249849 - clear  1315494546.099993000, sequence: 37249864
ctrl-C

The above sequence shows that the system clock is behind the GPS. The system time when the interrupt function is being called on the PPS assert is 5 microseconds before the exact second (0.999995). This corresponds to a NTPClockOffset of a positive 5 microseconds. This is confirmed with the ntpq program (which reports its offset in milliseconds):

root@manitou root# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
xral             38.229.71.1      3 u   34   64  377    0.320    3.804   0.031
 LOCAL(0)        .LOCL.          10 l  93d   64    0    0.000    0.000   0.000
oGPS_NMEA(0)     .GPS.            2 l    6   16  377    0.000    0.005   0.031

The ntpq output indicates (with the leading 'o') that NTP is using the GPS as the system's reference clock. It also displays the offset of the RAL server's clock of 3.804 milliseconds, and indicates with an 'x' that it is not using that clock as a reference. The RAL server uses NTP over a WIFI connection to adjust its clock, so it is not as accurate as the DSM.

The loopstats file also shows the 5 usec offset at this time:

55812 54504.454 0.000005000 39.301 0.000030518 0.001408 4
55812 54520.455 0.000006000 39.302 0.000030518 0.001415 4
55812 54536.454 0.000005000 39.303 0.000030518 0.001392 4
55812 54552.454 0.000005000 39.305 0.000030518 0.001372 4

I do not believe I've seen a jitter value less than 31 microseconds. Not sure why that is. I believe the jitter is the standard deviation of the offset, but the NTP documentation is rather unclear to me.