CAM/CLM (GPTL-based) Timers
The CAM timing library interface is simple and flexible, but there are limitations (some related to the design; some simply a function of the implementation).
Basic Interface
[0] To use the library interface routines in a routine or a module, add
use perf_mod
[1] To enable the instrumentation,
call t_initf('NlFilename', LogPrint=.true., mpicom=mpicom, MasterTask=.true.)
where
- 'NlFilename' is the file containing the namelist prof_inparm
- LogPrint is a (optional) logical indicating whether the timing library parameters should be output to standard out
- mpicom is the MPI communicator for a subset of processes reading prof_inparm from 'NlFilename'.
- MasterTask is a logical indicating whether process 0 (for this communicator) will read and broadcast the namelist or whether all processes read the namelist individually.
If 'NlFilename' does not exist, or if prof_inparm is not
found in 'NlFilename', then the default values will be used.
(The job will not abort.) MasterTask and mpicom are optional
parameters. If they both exist, then process 0 (for this
communicator) will read the prof_inparm namelist and broadcast
these values to the rest of the processes associated with this
communicator. Otherwise, each process reads the namelist directly.
(Note that the value of MasterTask is ignored.)
[2] To define an 'eventX', surround the relevant code with calls of the form:[3] To print out the performance data, callcall t_startf('eventX') ... call t_stopf('eventX')
Depending on the namelist parameters, this either puts all of the timer data into one file (named 'PrFilename'), or generates a separate file for each MPI process (named 'PrFilename.pid'). The information is the same - there is no reduction or sampling currently. Note that subsets of processes (defined by the mpi_commid) can call t_prf separately. In this case, each subset should use a unique filename.call t_prf('PrFilename', mpicom)
Performance data are recorded per event for each process, and for each thread within a process. Event data include number of occurrences, total time (inclusive), maximum time, minimum time, and total performance data collection overhead (estimated). Events are output in order of their occurrence, with indentation indicating the nesting of events.
In comparison with the POP timers, the events do not need to be defined during the initialization, nor do they have to be the same within all processes or threads (an advantage). Event statistics are also not generated across processes (a disadvantage that this is not an option. Note that writing out data for each process is not an option for the POP timers, though the data is collected internally.)Other Commands
[a] clean-up and shutdown the instrumentation:[b] add an instrumented barrier (that is enabled only when specified in the namelist):call t_finalizef()
Both arguments are optional. If the string is omitted, then an event is not generated for the barrier. If mpicom is omitted, then MPI_COMM_WORLD is used. If executed within a threaded region, the command is ignored.call t_barrierf(event='sync_eventX', mpicom=mpicom)
[c] defined the level of detailed represented by subsequent events:If executed within a threaded region, the command is ignored.call t_adj_detail(detail_adjustment)
[d] disable event profiling for a section of code, then re-enable it:If executed within a threaded region, these commands are ignored.call t_disablef() ... call t_enablef()
[e] query wallclock, user, and system times:This command does nothing on the Cray XT system (because of the Catamount OS).call t_stampf(wall, usr, sys)
Namelist Arguments
profile_disable
- logical indicating whether perf_mod routines should be disabled for the duration of the run. The default is .false. .
profile_barrier
- logical indicating whether calls to t_barrierf are enabled. The default is .false. .
profile_single_file
- logical indicating whether the performance timer output should be written to a single file (per component communicator) or to a separate file for each process. The default is .true. .
profile_depth_limit
- integer indicating maximum number of levels of timer nesting. When the nesting exceeds this maximum, further event profiling is disabled. This controls the detail and size of the profile output. It also (usually) controls the overhead of the profiling in that the higher frequency short events typically occur deeper in the event nesting. The default is 99999 .
profile_detail_limit
- integer indicating maximum detail level to profile. The command t_adj_detail allows the user to define the level of "detail" at a given point in the source code. The namelist parameter then specifies what levels of detail will be profiled, similar to the control on the nesting depth. In CAM, this is used to disable profiling during the intialization routines, and within the loops of chunks (which occur at a higher frequency when chunk sizes are small). The default is 0 .
profile_timer
- integer indicating which timer to use (as defined in gptl.inc). This does nothing yet, but will provide runtime control of the timer used for profiling if/when we move to Jim Rosinksi's latest version of the GPTL timing library. The default is
#ifdef UNICOSMP integer, parameter :: def_perf_timer = GPTLrtc ! default #else integer, parameter :: def_perf_timer = GPTLmpiwtime ! default #endif
- Note 1: these entry points are available only from fortran. C routines could call the GPTL library directly, but this would not be identical to calling the perf_mod routines. Providing equivalent C entry points might be one generalization that we would want to consider.)
- Note 2: The nesting indentation is approximate. In the original Rosinski timing library, events that did not occur the first time through a timestep were relegated to the end of the list. Jim Edwards modified this to put the events in their correct location. However, if an event occurs multiple places (within multiple other events) it is listed as occurring in only one location.
Future extensions
- events defined by character string and by nesting within other events.
- statistical summary over all processes (might be too difficult, especially is add "call-site" profiling)