Executing many serial or small jobs at once

There are numerous ways to execute many small/serial independent tasks at once. The ultimate choice depends upon specific needs of the tasks, the HPC system(s) involved, and personal preference.

Problem: The user has many independent, serial tasks to execute at once. Or, the tasks are very small, either MPI or OpenMP, but will utilize few enough cores that multiple tasks should be packed into each node.

1. Scenario 1a: Serial Tasks on Crays with only aprun

Aprun can be used to launch scripts, not just executables. It will launch n instances of the script on the appropriate nodes, and the script can determine its own instance number by examining the environment variable $ALPS_APP_PE . Consider the following batch script:

#PBS -l select=2:ncpus=32:mpiprocs=32
aprun -n 64 myscript.sh

Then, here is an example of myscript.sh that uses:

filename=inputfile_`printf %03d $ALPS_APP_PE`
# ^^^ note: filename is inputfile_### where ### is file number starting at zero
./myexecutable $filename

The result will be that 64 instances of myexecutable run across 2 nodes (assuming 32 cores each), each loading a separate file that ranges from inputfile_000 to inputfile_063.

2. Scenario 1b: Serial Tasks on SGIs and IBMs with only Intel MPI

This works exactly the same as Scenario 1a. However, the differences are:

  • Swap out the SGI mpt module for Intel MPI, e.g., module swap mpt intel-mpi-15 (on Thunder) [SGI's mpt does not appear to support executing serial code]
    • On IBM iDataPlexes with Intel MPI, if it is not the default, swap out the default MPI for Intel MPI
  • In the batch script, replace aprun with mpiexec.hydra
  • Replace $ALPS_APP_PE with $PMI_RANK

3. Scenario 2: Serial Tasks with Python and MPI

The Common Open Source Toolkit (COST), which is provided on all HPCMP systems, includes the mpi4py module. This should work on any system, and has been tested with aprun on Cray, mpiexec_mpt on SGI, and mpiexec.hydra on IBM iDataPlex with Intel MPI. Consider the following batch script for a Cray:

#PBS -l select=2:ncpus=32:mpiprocs=32
module load costinit python mpi4py   # Required for python & mpi (add other required python modules also)
aprun -n 64 python myscript.py

Then the following python script, myscript.py:

from mpi4py import MPI
file="inputfile_%03d" % (comm.rank)

do_something_with_a_file() could launch an external serial executable, do some processing on its own in Python, or a little of both. The advantage of using Python with MPI is that communication is possible, just like any MPI code in another language. More information is here: http://pythonhosted.org/mpi4py/.

4. Scenario 3: Serial Tasks on Crays with CCM

Cray's provide Cluster Compatibility Mode (CCM), which sets up a series of nodes in a batch job without Cray's usual special environment. The advantage of this approach is maximum flexibility on how tasks are farmed out to nodes, and an environment that is similar to a traditional cluster. However, node selection and other setup must be handled manually.

CCM is enabled with the PBS resource statement: -l ccm=1. The user uses ccmrun instead of aprun and can use ssh to access all nodes available to that running batch job. The list of nodes is available in the node file: $PBS_NODEFILE. However, the node file is usually not visible from a compute node, so it must be copied before ccmrun (or passed to the compute nodes some other way). Starting with the following batch script for 2 nodes with 32 cores each:

#PBS -l select=2:ncpus=32:mpiprocs=32
#PBS -l ccm=1
cd $PBS_O_WORKDIR       # go to where I'm working, which is where I was submitted
cp $PBS_NODEFILE ./mynodefile
ccmrun myscript.sh 64   # run myscript.sh on 1st compute node and tell it to use 64 tasks

The script, myscript.sh will be launched on the 1st node of the requested nodes for the batch job, e.g.:

for (( i=0; i<$tasks; i++ )); do
   let line=$i+1                               # Task i is line i+1 of nodefile
   node=`head -$line mynodefile | tail -1`     # Grab line# $line from nodefile
   ssh $node mytask.sh $i &                    # Note the '&' to run i the background
wait   # wait for all the background ssh's above to finish

myscript.sh will execute 64 instances of mytask.sh, 32 per node. Note that $i was passed to mytask.sh, so mytask.sh can query its first argument $1 to determine its instance/rank number.