Goal

Understand how to run Multiple Program Multiple Data jobs on derecho, especially when the job consists of multiple single-CPU commands, which we would like to run in heterogeneous combinations.  For example, start one long-running command on one CPU, then start a group of a second command (with differing data among the members) possibly using a command file algorithm.  When the group command finishes, start a second group command.  When the long-running command and the ensemble commands are done, clean up and quit.

      > long-running-command on CPU0 ---------------------------------------------- |quit

      > ensemble command on CPUs1-80 ------- > ens.cmd on CPUs1-55 --------         |quit

Job arrays won't work (easily?) for this because they are designed for SPMD work.  They require a range of integers which signal which input data to feed to the same command.

I originally (on yellowstone) wrote a script to use most of the processors on multiple nodes.  I had to adapt it to an earlier version of casper, which limited the number of MPI processes, so it processed only batches of commands as large as the ensemble size.  This scripting could be rewritten, but I'm looking for a set of minimal changes to make it work on derecho, on one node (128 cores).

[Apologies for the line spacing.  I entered all text using <ctl>-c and <ctl>-v.  Sometimes the editor interpreted each line as a paragraph, sometimes not.  I have not found a  way to control vertical spacing other than changing the css for this page, which I'm not willing to try.]

Naming of Things

I'll try to avoid using "process", "processor", and "CPU" without a modifier, since they seem to be a major source of (my) confusion in documentation.

Computer parts

     In this document derecho has parts as follows, from the smallest up.  These choices are subjective and vary widely in documentation.

      (transistors: the tiniest hardware bits that can do a specialized computation.

          Not accessible to users)

       thread (sometimes "logical core", "data lane"): 2/"processor core".

          The smallest hardware unit accessible to users for general computational tasks.

          Daniel: the physical hardware "data lanes" on a CPU where programs run.

            On Derecho's AMD chips there are 2 threads per core

            in what is often called hyperthreading but performance does not necesarily improve

            and can actually worsen when trying to hyperthread 2 threads per core

            (128 max threads per socket).

       processor core (sometimes "CPU"): a physical unit that can compute on 1 or 2 threads.

          128/node.  I'll use "core" to mean a hardware processor core.

       socket; (Sometimes "processor" refers to the computational machinery on a socket.)

          2/node. A hardware bundle of 64 cores.

           I'm not aware of any MPI implications.

       node: the largest computing unit.  Has 128 cores.

          Some documentation (pbs_resources) refers to vnodes (virtual nodes)

          which are an abstraction of the idea of a node being a semi-independent

          set of resources.  I believe that the only vnodes defined on derecho

          are the physical nodes.


Instructions to Derecho


     job; the script which is submitted to the batch queue and sets up resources.
     command; the script which does the work we want done.
         May be MPI-enabled.  Knows what resources it needs,
         but doesn't know about available resources.
     mpiexec ("mpirun" "orterun"(casper only))
         the command which starts 1 or more MPI-enabled commands
     copy (sometimes "instance")
         One of several simultaneous executions of the same command,
         with differing inputs andor active sections of code (based on the copy number,
         array index, PMI_RANK, OMPI_COMM_WORLD_RANK, ...)
     rank (often "process" or "task")
         a single MPI computational sequence, which runs on a single core
         (assuming no OpenMP threading).


Resources and Using Them 

   mpiexec

     :   Separates the requests to run different commands, which require different resources

     -n , -np, --np

         Number of processes to start (default: 1).

         e.g. Launch <application> with four processes,

!           This is inconsistent with the examples; they start -n COPIES of <application>

            and each copy is assigned -ppn processes (unless they use "processes" differently

            in the 2 contexts, which is very common.  Then what do they mean either place?)

            Rory says that derecho may choose to put each of these on a separate dedicated node,

            even if it only requests a few CPUs per copy (and doesn't need much memory).

            So the job may be 100x as expensive as expected.

            Some Derecho docs say that 'develop' queue jobs are limited to 2 nodes for a user,

            but that's only true if others are trying to use the develop queue.

            This can be controlled by the "PBS -place" argument.

     -ppn, --ppn

         Processes/  Ranks  per compute node.

         "Assign --ppn ranks to each node in host list; each MPMD command starts on a new node."

!        This is inconsistent with the examples; there -ppn assigns some number of processes

         to each COPY, not each node (except that derecho defaults to putting each copy

         on a separate node).

         And if a copy is a "process", which might be implied by -n, then this is assigning

         -ppn processes to each process.  Less than helpful.

     -cpu-bind

         Control how ranks are bound to cores.

         Apparently not directly relevant to this issue.  But it's not solved, so maybe.

   PBS, qsub

     One strategy is to use PBS's "chunk" architecture.
     A chunk is a set of resources available to an MPI-enabled command.

     ncpus           = Number of processors to be used by a chunk.
                          = mpiprocs * ompthreads (if not oversubscribing)
     mpiprocs     = Number of MPI processes for this chunk.
     ompthreads = Number of ompthreads per MPI process. Default 1
     mem             = how much memory this chunk needs

     Chunks are used 1-by-1 in the order they are listed in the #PBS -select directive.
     So the chunk list needs to match the commands' needs.

     #PBS -l select=1:ncpus=1:mpiprocs=1:mem=10GB+3:ncpus=2:mpiprocs=2:mem=8GB
    + requests 4 chunks;
        the first one is for a single core command and
        the last 3 are for commands which can run 2 MPI processes each, each on its own core.
     + These chunks could be used in a single mpiexec statement:
         > mpiexec --cpu-bind core -n 1 <application>  :  n 3 --ppn 2 <app2>

       It apparently gets tricky if I want to start 2 mpiexec commands separately
       and have them run at the same time.  This might be done by running the first command
       in the background, and then the subsequent commands.
       > mpiexec -n 1 --ppn 1 single-core.csh  &
       > mpiexec -n 3 -ppn 2 dual-core.csh    (runs 3 copies of dual-core using 2 cores for each.
       But derecho puts all 4 of these commands on separate nodes, even though they easily fit on 1.
       It does run them correctly, but at many times the cost than it should.

       It's possible to make derecho run all of the commands on 1 node using

       #PBS -place:pack,shared

             pack;     All chunks are taken from one host.

             shared;   This job can share the vnodes chosen.

       But then mpiexec is unable to assign the ranks to the cores for the multi-core commands, regardless of any --cpu-bind arguments.

       Rory Kelly, Daniel Howard and I could not find a combination of PBS directives and mpiexec arguments to make this work.

       `numactl -s | grep physcpubind` is useful for seeing the cores that are assigned to this job only on the head node

             (unless is it called in the commands which are run by mpiexec).

             physcpubind also lists the 2nd set of threads, which are associated with the first set of cores.  

             3 cores; physcpubind = 0 1 2 128 129 130 

             It's essential when trying to --cpu-bind list:$cpulist , when not using all of the cores on a node,

             because any cores might be used; 2,10,11 and they must be explicitly named.

        $PBS_NODEFILE is also useful

            "Name of file containing the list of vnodes assigned to the job.  Created when qsub is run."

             KR: Actually it's a list containing the vnode name of each MPI rank requested for the job:

             sum(mpiprocs_from_all_chunks) = 1*1 + 3*2 = 7  in the example above 

             The pbs_resources man page writes about vnodes, but doesn't define them.

              Argonne describes them:

                  vnode: A virtual node, or vnode, is an abstract object representing a host

                  or a set of resources which form a usable part of an execution host.   

                  PBS operates on vnodes.

                  This could be an entire host, or a nodeboard or a blade.

                  A single host can be made up of multiple vnodes.

                  Each vnode can be managed and scheduled independently.

                  Each vnode in a complex must have a unique name.

                  Vnodes on a host can share resources, such as node-locked licenses.

!                 Searching for vnode in CISL's ARC returns nothing.

                  This implies that vnode = "node" on derecho.

        "execution host": I believe it is also a node on derecho.

Alternatives

     Our failure to get this to work on derecho led to trying it on casper,
     by changing the queue requested to casper@casper-pbs and submitting from derecho.
     This fails because the mpiexec commands are quite different on the 2 computers.
     Casper's doesn't have the -ppn argument and an unknown number of other differences.
     At this point I'm giving up.  I'll continue to run the single- and multi-core commands in separate jobs.

     Also on casper; the docs say to limit jobs to 1 node (36 cores), but Rory says it's okay to use up to 255 (not 256) for short jobs.

     Another strategy is to rewrite the scripts to make them fit the PBS job array architecture.
     The system default launch_cf uses this.
     It requires that each copy of the script to be run fetch its MPI rank from the environment
     and execute commands based on that.  This is awkward for a heterogeneous set of commands.
     And it's designed to run each command on a separate node, so that when it's done
     it can free the node for others to use.  That's expensive for commands that use a small
     number of cores.

  • No labels