mpiprocs Hook

In the current batch job scheduling environment, a submitted job is rejected at run time on a Cray system if the number of MPI processors (mpiprocs) does not have a valid value with respect to the number of processors required to run the job. Therefore, a job will fail at run time whenever mpiprocs does not have a valid value.

Instead of rejecting a job at run time, a tool was developed, called "mpiprocs hook" that anticipated a job's failure at submission time. Rejecting the job at submission time saves users effort, saves system resources, and also, eliminates time wasted by placing a job in the queue, only to have it fail at run time.

The "mpiprocs hook" is designed to reject jobs that have exceeded the maximum total mpiprocs per job request or if the job's mpiprocs value is not a factor of the number of requested cpu's. This validation is done by querying a job's mpiprocs on Cray systems prior to accepting the job into the queues. The user will know immediately if their mpiprocs is valid. If not, the user will get a warning message that explains why the mpiprocs setting was rejected along with a recommendation for a valid mpiprocs setting.

Key features are:

  • Reject jobs that have exceeded the maximum total mpiprocs.
  • Reject a job if the mpiprocs value is not a factor of the number of requested cpu's.
  • Support for accelerator nodes (nmics), which is machine configurable.
  • Support for multiple chunks within a select statement.
  • Support system default values; the hook will also accept a job if mpiprocs is not set or if mpiprocs is set to zero.

The newly developed "mpiprocs hook" is now in production on all HPCMP allocated HPC systems.

An article by BC Team on "mpiprocs Hook" was published in the April 2017 edition of the What's New @ HPCMP quarterly newsletter.