Nautilus Slurm Guide

1. Introduction

On large-scale computers, many users must share available resources. Because of this, you can't just log on to one of these systems, upload your programs, and start running them. Essentially, your programs must "get in line" and wait their turn, and there is more than one of these lines, or queues, from which to choose. Some queues have a higher priority than others (like the express checkout at the grocery store). The queues available to you are determined by the projects you are involved with.

To perform any task on the compute cluster, you must submit it as a "job" to a special piece of software called the scheduler or batch queueing system. At its most basic, a job can be a command non-interactively, but any command (or series of commands) you want to run on the system is called a job.

Before you can submit your job to the scheduler, you must describe it, usually in the form of a batch script. The batch script specifies the computing resources needed, identifies an application to be run (along with its input data and environment variables), and describes how best to deliver the output data.

The process of using a scheduler to run the job is called batch job submission. When you submit a job, it is placed in a queue with jobs from other users. The scheduler then manages which jobs run, where, and when. Without the scheduler users could overload the system, resulting in tremendous performance degradation for everyone. The queuing system runs your job as soon as it can do so while still honoring the following:

  • Meeting your resource requests
  • Not overloading the system
  • Running higher priority jobs first
  • Maximizing overall throughput

The process can be summarized as:

  1. Create a batch script.
  2. Submit a job.
  3. Monitor a job.

1.1. Document Scope

This document provides an overview and introduction to the use of the Slurm batch scheduler on the Penguin Computing TrueHPC (Nautilus) located at the Navy DSRC. The intent of this guide is to provide information to enable the average user to submit jobs on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:

  • Use of the Linux operating system
  • Use of an editor (e.g., vi or emacs)
  • Remote use of computer systems via network
  • A selected programming language and its related tools and libraries

We suggest you review the Nautilus User Guide before using this guide.

2. Resources and Queue Information

2.1. Resource Summary

When working on an HPC system you must specify the resources your job needs to run. This lets the scheduler find the right time and place to schedule your job. Strict adherence to resource requests allows Slurm to find the best possible place for your jobs and ensures no user can use more resources than they've been given. You should always try to specify resource limits that are close to but greater than your requirements so your job can be scheduled more quickly. This is because Slurm must wait until the requested resources are available before it can run your job. You cannot request more resources than are available on the system, and you cannot use more resources than you request. If you do, your job may be rejected, fail, or remain indefinitely in the queue.

Nautilus is a batch-scheduled Batch-scheduled - users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available HPC system with numerous nodes. All jobs that require large amounts of system resources must be submitted as a batch script Batch Script - A script that provides resource requirements and commands for the job.. As discussed in Section 3, scripts are used to submit a series of directives that define the resources required by your job. The most basic resources include time, nodes, and memory.

2.2. Node Information

Below is a summary of node types available on Nautilus. Refer to the Nautilus User Guide for in-depth information.

  • Login nodes - Access points for submitting jobs on Nautilus. Login nodes are intended for basic tasks such as uploading data, managing files, compiling software, editing scripts, and checking on or managing your jobs. DO NOT run your computations on the login nodes.
  • Compute nodes - Node types such as "Standard ", "Large-Memory", "GPU", etc. are considered compute nodes. Compute nodes can include:
    • Standard nodes - The compute node type that is standard on Nautilus.
    • Large-Memory Nodes - Large-memory nodes have more memory than standard nodes and are intended for jobs that require a large amount of memory.
    • GPU nodes - GPU nodes are specialized accelerated compute nodes with additional hardware to speed up work, often with parallel processing that bundles frequently occurring tasks.
    • AI/ML nodes - AI/ML nodes are specialized GPU nodes intended for machine learning and other compute-intensive applications. There is no significant difference between AI/ML and GPU nodes.
    • Visualization nodes - Visualization nodes are GPU nodes intended for visualization applications.
    • High Performance nodes- Nodes have the benefit of higher clock speeds but the drawback of lower thread size and cache size.
    • Transfer nodes - Nodes exist to help users conserve allocation when transferring data.

A summary of the node configuration on Nautilus is presented in the following table.

Node Configuration
Login Standard Large-Memory Visualization AI/ML High Core Performance
Total Nodes 14 1,304 16 16 32 32
Processor AMD 7713 Milan AMD 7713 Milan AMD 7713 Milan AMD 7713 Milan AMD 7713 Milan AMD 73F3 Milan
Processor Speed 2 GHz 2 GHz 2 GHz 2 GHz 2 GHz 3.4 GHz
Sockets / Node 2 2 2 2 2 2
Cores / Node 128 128 128 128 128 32
Total CPU Cores 1,792 166,912 2,048 2,048 / 16 4,096 / 128 1,024
Usable Memory / Node 433 GB 237 GB 998 GB 491 GB 491 GB 491 GB
Accelerators / Node None None None 1 4 None
Accelerator n/a n/a n/a NVIDIA A40 PCIe 4 NVIDIA A100 SXM 4 n/a
Memory / Accelerator n/a n/a n/a 48 GB 40 GB n/a
Storage on Node 1.92 TB NVMe SSD None 1.92 TB NVMe SSD None 1.92 TB NVMe SSD None
Interconnect HDR InfiniBand HDR InfiniBand HDR InfiniBand HDR InfiniBand HDR InfiniBand HDR InfiniBand
Operating System RHEL RHEL RHEL RHEL RHEL RHEL

2.3. Queue Information

Queues are where your jobs run. Think of queues as a resource used to control how your job is placed on the available hardware. Queues address hardware considerations and define policies such as what type of jobs can run in the queues, how long your job can run, how much memory your job can use, etc. Every queue has its own limits, behavior, and default values.

On a first come first serve basis, the scheduler checks whether the resources are available for the first job in the queue. If so, the job is executed without further delay. But if not, the scheduler goes through the rest of the queue to check whether another job can be executed without extending the waiting time of the first job in queue. If it finds such a job, the scheduler backfills the job. Backfill scheduling allows out-of-order jobs to use the reserved job slots if these jobs do not delay the start of another job. Therefore, smaller jobs (i.e., jobs needing only a few resources) usually encounter short queue times.

On Nautilus, quality of service is used to define scheduling priority and job limits. Your queue options are determined by your projects. Most users have access to the debug, standard, background, and HPC Interactive Environment (HIE) queues. Other queues exist, but access to these queues is restricted to projects that are granted special privileges due to urgency or importance, and they are not discussed here. To see the list of queues available on the system, use the show_queues command. Use the --wide option to see additional details.

Standard Queue
As its name suggests, the standard queue is the most common queue and should be used for normal day-to-day jobs.

Debug Queue
When determining why your job is failing, it is very helpful to use the debug queue. It is restricted to user testing and debugging jobs and has a maximum walltime of one hour. Because of the resource and time limits, jobs progress through the debug queue more quickly, so you don't have to wait many hours to get results.

Background Queue
The background queue is a bit special. Although it has the lowest priority, jobs in this queue are not charged against your project allocation. You may choose to run in the background queue for several reasons:

  • You don't care how long it takes for your job to begin running.
  • You are trying to conserve your allocation.
  • You have used up your allocation.

HPC Interactive Environment (HIE) Queue
The HIE is both a queue configuration and a computing environment intended to deliver rapid response and high availability to support the following services:

  • Remote visualization
  • Application development for GPU-accelerated applications
  • Application development for other non-standard processors on a particular system

There is a very limited number of nodes available to the HIE queue, and they should be reserved for appropriate use cases. The use of the HIE queue for regular batch processing is considered abuse and is closely monitored. The HIE queue should not be used simply as a mechanism to give your regular batch jobs higher priority. Refer to the HIE User Guide for more information.

Priority Queues
The HPCMP has designated three restricted queues that require special permission for job submission. If your project is not authorized to submit jobs to these queues, your submission will fail. These queues include:

  • Urgent queue - Jobs belonging to DoD HPCMP Urgent Projects
  • High queue - Specific for Jobs belonging to DoD HPCMP High Priority Projects
  • Frontier queue - Specific for jobs belonging to DoD HPCMP Frontier Projects

The following table describes the Slurm queues available on Nautilus:

Queue Descriptions and Limits on Nautilus
Priority Queue Name Max Wall Clock Time Max Cores Per Job Description
Highest urgent 24 Hours 16,384 Jobs belonging to DoD HPCMP Urgent Projects
Down arrow for decreasing priority debug 30 Minutes 10,752 Time/resource-limited for user testing and debug purposes
HIE 24 Hours 3,072 Rapid response for interactive work. For more information see the HPC Interactive Environment (HIE) User Guide.
frontier 168 Hours 65,536 Jobs belonging to DoD HPCMP Frontier Projects
high 168 Hours 65,536 Jobs belonging to DoD HPCMP High Priority Projects
serial 168 Hours 1 Single-core serial jobs
standard 168 Hours 16,384 Standard jobs
transfer 48 Hours 1 Data transfer for user jobs. See the Navy DSRC Archive Guide, section 5.2.
Lowest background 4 Hours 4,096 User jobs that are not charged against the project allocation

3. Anatomy of a Batch Script

The Slurm scheduler is currently running on Nautilus. It schedules jobs, manages resources and job queues, and can be accessed through the interactive batch environment or by submitting a batch script. Slurm can manage both single-processor and multiprocessor jobs. The appropriate module is automatically loaded for you when you log in. This section is a brief introduction to Slurm. More advanced topics are discussed later in this document.

Batch Script Life Cycle
Let's start with what happens in the typical life cycle of a batch script, where an application is run in a batch submission:

  1. The user submits a batch script, which is put into the queue.
  2. Once the resources are allocated, the scheduler executes the batch script on one node, and the script has access to the typical environment variables the scheduler defines.
  3. The executable command in the script is encountered and executed. If using a launch command, the launch command examines the scheduler environment variables to determine the node list in the allocation, as well as parameters, such as the number of total processes, and launches the required number of processes.
  4. Once the executing process(es) have terminated, the batch script moves to the next line of execution or terminates if there are no more lines.

Batch Script Anatomy
A batch script is a small text file created with a text editor such as vi or notepad. Although the specifics of batch scripts may differ slightly from system to system, a basic set of components are always required, and a few components are just always good ideas. The basic components of a simple batch script must appear in the following order:

  • Specify Your Shell
  • Scheduler Directives
    • Required Directives
    • Optional Directives
  • The Execution Block

To simplify things, several template scripts are included in Section 7, where you can fill in required commands and resources.

Cautions About Special Characters

Some special characters are not handled well by schedulers. This is especially true of the following:

  • ^M characters - Scripts created on a MS Windows system, which usually contain ^M characters, should be converted with dos2unix before use.
  • Smart quotes - MS Word autocorrects normal straight single and double quotation marks into "smart quotes." Ensure your script only uses normal straight quotation marks.
  • Em dash, en dash, and hyphens - MS Word often autocorrects regular hyphens into em dash or en dash characters. Ensure your script only uses normal hyphens.
  • Tab characters - many editors insert tabs instead of spaces for various reasons. Ensure your script does not contain tabs.

3.1. Specify Your Shell

Your batch script is a shell script. So, it's good practice to specify which shell your script is written in for execution. If you do not specify your shell within the script, the scheduler uses your default login shell. To tell the scheduler which shell to use, the first line of your script should be: #!/bin/shell where shell is either bash (Bourne-Again Shell), sh (Bourne Shell), ksh (korn shell), csh (C shell), tcsh (enhanced C shell), or zsh (Z shell).

3.2. Required Scheduler Directives

After specifying the script shell, the next section of the script sets the scheduler directives, which define your resource requests to the scheduler. These include how many nodes are needed, how many cores per node, what queue the job will run in, and how long these resources are required (walltime).

Directives are a special form of comment, beginning with #SBATCH. As you might suspect, the # character tells the shell to ignore the line, but the scheduler reads these lines and uses the directives to set various values. IMPORTANT!! All directives MUST come before the first line of executable code in your script, otherwise they are ignored.

The scheduler has numerous directives to assist you in setting up how your job will run on the system. Some directives are required. Others are optional. Required directives specify resources needed to run the application. If your script does not define these directives, your job will be rejected by the scheduler or use center-defined defaults. Caution: default values may not be in line with your job requirements and may vary by center. Optional directives are discussed in Section 5.

To schedule your job, the scheduler must know:

  • The queue to run your job in.
  • The maximum time needed for your job.
  • The Project ID to charge for your job.
  • The number of nodes you are requesting.
  • The number of processes per node you are requesting.
  • The number of cores per node.
  • The total number of cores
  • How nodes should/can be allocated.
3.2.1. Specifying the Queue

You must choose which queue you want your job to run in. Each queue has different priorities and limits and may target different node types with different hardware resources. To specify the queue, include the following directive: #SBATCH -q queue_name

3.2.2. How Long to Run

Next, the scheduler needs the maximum time you expect your job to run. This is referred to as walltime, as in clock on the wall. The walltime helps the scheduler identify appropriate run windows for your job. For accounting purposes, your allocation is charged for how long your job actually runs, which is typically less than the requested walltime.

In estimating your walltime, there are three things to keep in mind.

  • Your estimate is a limit. If your job hasn't completed within your estimate, it is terminated. So, you should always add a buffer to account for variability in run time because you don't want your job to be killed when it is 99.9% complete. And, if your job is terminated, your account is still charged for the time.
  • Your estimate affects how long your job waits in the queue. In general, shorter jobs run before longer jobs. If you specify a time that is too long, your job will likely sit in the queue longer than it should.
  • Each queue has a maximum time limit. You cannot request more time than the queue allows.

To specify your walltime, include the following directive: #SBATCH --time=DD-HH:MM:SS or #SBATCH -t DD-HH:MM:SS

3.2.3. Your Project ID

The scheduler needs to know which project ID to charge for your job. You can use the show_usage command to find the projects available to you and their associated project IDs. In the show_usage output, project IDs appear in the column labeled "Subproject."

Note: If you have access to multiple projects, remember the project you specify may limit your choice of queues.

To specify the project ID for your job, include the following directive: #SBATCH --account=Project_ID or #SBATCH -A Project_ID

3.2.4. Number of Nodes, Processes, and Cores

There are two types of computational resources: hardware (compute nodes and cores) and virtual (processes). A node is a computer system with a single operating system image, a unified memory space, and one or more cores. Every script must include directives for the node, process, and task selection. Nodes are allocated exclusively to your job and not shared with other users.

Before Slurm can run your job, it needs to know how many nodes you want, the total number of tasks (processes), and the number of tasks per node. In general, you would specify one task per core, but you might want fewer tasks depending on the programming model you are using. See Example Scripts (below) for alternate use cases.

The number of nodes, the number of tasks, and the number of tasks per node are specified using the directives: #SBATCH --nodes=N1 #SBATCH --ntasks=N2 #SBATCH --ntasks-per-node=N3 or #SBATCH -N N1 #SBATCH -n N2 #SBATCH --ntasks-per-node=N3 where N1 specifies the number of nodes you are requesting, N2 is the number of tasks, and N3 is the number of tasks per node.

Generally, you only need to use any two of these three directives. For example, you could specify the total number of nodes and total tasks and let Slurm decide the number of tasks per node. In this case the directives would be: #SBATCH --nodes=N1 #SBATCH --ntasks=N2 where N1 is the number of nodes you are requesting and N2 is the total number of tasks.

In general, the --ntasks-per-node default is the total number of cores on the node, but there may be situations where you might want to specify a lower value. If you are porting a PBS script to Slurm, using --nodes and --ntasks-per-node is the simplest conversion for the select and mpiprocs values.

3.2.5. SLB Directives

The Shared License Buffer (SLB) regulates shared license usage across all HPC systems by granting and enforcing license reservations for certain commercial software packages. If your job requires enterprise licenses controlled by SLB, you must enter the software and requested number of licenses, using the following directive: #SBATCH --licenses=software:number_of_licenses or #SBATCH -L software:number_of_licenses

To request licenses for multiple applications, separate them by commas, as follows: #SBATCH -L software:number_of_licenses,software:number_of_licenses

For more information about SLB, please see the SLB User Guide or the SLB Quick Reference Guide.

3.3. The Execution Block

After the directives have been supplied, the execution block begins. The execution block is the section of your script containing the actual work to be done. This includes any modules to be loaded and commands to be executed. This could also include executing or sourcing other scripts.

3.3.1. Basic Execution Scheme

The following describes the most basic scheme for a batch script. PLEASE ADOPT THIS BASIC EXECUTION SCHEME IN YOUR OWN BATCH SCRIPTS.

Setup

  • Set environment variables, load modules, create directories, transfer input files.
  • Changing to the right directory - By default Slurm runs your job in the directory from which it is submitted, which can cause problems. To avoid this, cd into your $WORKDIR directory to run it on the local high-speed disk.

Launching the executable

  • Launch your executable using the launch command on Nautilus specific to your programming model.

Cleaning up

  • Archive your results and remove temporary files and directories.
  • Copy any necessary files to your home directory.
3.3.1.1. Setup

Using the batch script to set up your environment ensures your script runs in an automatic and consistent manner, but not all environment-setup tasks can be accomplished via scheduler directives, so you may have to set some environment variables yourself. Remember that commands to set up the environment must come after the scheduler directives. For MPI jobs, each MPI process is separate and inherits the environment set up by the batch script.

As part of the Baseline Configuration (BC) initiative, there is a common set of environment variables on all HPCMP allocated systems. These variables are predefined in your login, batch, and compute environments, making them automatically available at each center. We encourage you to use these variables in your scripts where possible. Doing so helps to simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems within the HPCMP. Some BC environment variables are shown in the table below.

HPCMP Baseline Configuration Initiative Common Variables
Variable Description
$WORKDIR Your work directory on the local temporary file system (i.e., local high-speed disk). $WORKDIR is visible to both the login and compute nodes and should be used for temporary storage of active data related to your batch jobs.
$CENTER Your directory on the Center-Wide File System (CWFS).
$ARCHIVE_HOME This is your directory on the archival file system that serves a given compute platform.

The complete list of BC environment variables is available in BC Policy FY05-04.

Setup considerations in customizing your batch job may include:

  • Creating a directory in $WORKDIR for your job run in. NEWDIR=$WORKDIR/MyDir mkdir -p $NEWDIR
  • Changing to the directory from which the job will run. cd $NEWDIR
  • Copying required input files to the job directory. cp From_directory/file $NEWDIR
  • Ensuring required modules are loaded. module load module_name
3.3.1.2. Launching an Executable

The command you'll use to launch a parallel executable within a batch script depends on the parallel library loaded at compile and execution time, the programming model, and the machine used to launch the application. It does not depend on the scheduler. Launch commands on Nautilus are discussed in detail in the Nautilus User Guide.

On Nautilus, the mpirun command is used to launch a parallel executable. The basic syntax for launching an MPI executable is: mpirun args executable pgmargs

where args are command-line arguments for mpirun, executable is the name of an executable program, and pgmargs are command-line arguments for the executable.

3.3.1.3. Cleaning Up

You are responsible for cleaning up and monitoring your workspace. The clean-up process generally entails deleting unneeded files and transferring important data left in the job directory after the job is completed. It is important to remember that $WORKDIR is a "scratch" file system and is not backed up. Currently, $WORKDIR files older than 21 days are subject to being purged. If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. Similarly, files transferred to $CENTER are not backed up, and files older than 180 days are subject to being purged. To prevent automatic deletion by the purge scripts, important data should be archived. See the Navy DSRC Archive Guide for more information on archiving data.

3.3.2. Advanced Execution Methods

A batch script is a text file containing directives and execution steps you "submit" to Slurm. This script can be as simple as the basic execution scheme discussed above or include more complex customizations, such as compiling within the script or loading a file with a list of modules required for the executable. Below are additional considerations for the execution block of the batch script.

3.3.2.1. Environment Variables set by the Scheduler

In addition to environment variables inherited from your user environment (see Section 3.3.1.1), Slurm sets other environment variables for batch jobs. The following table contains commonly used Slurm environment variables.

Slurm Environment Variables
Variable Description
$SLURM_JOB_ACCOUNT The Project ID charged for the job
$SLURM_JOB_ID The job identifier assigned to a job or job array by the batch system
$SLURM_JOBID (deprecated) Identical to $SLURM_JOB_ID. Included for backwards compatibility
$SLURM_SUBMIT_DIR The absolute path of directory where the job was submitted
$SLURM_JOB_NAME The job name supplied by the user
$SLURM_ JOB_PARTITION The partition in which the job executes
$SLURM_JOB_QOS The Quality of Service (QOS) i.e., job queue, of the job
$SLURM_SUBMIT_HOST The hostname of the node from which sbatch was executed
$SLURM_NTASKS The total number of cores used in a job
$SLURM_JOB_NODE_LIST The list of nodes allocated to the job
$SLURM_JOB_NUM_NODES The total number of nodes allocated to the job
$SLURM_JOB_ARRAY_ID The job ID for a job array
$SLURM_ARRAY_TASK_ID The index number for a sub job in a job array
$SLURM_ARRAY_TASK_COUNT Total number of tasks in a job array
$SLURM_ARRAY_TASK_MAX A job array's maximum index number
$SLURM_ARRAY_TASK_MIN A job array's minimum index number
$SLURM_ARRAY_TASK_STEP A job array's index step size
See the sbatch man page for a complete list of environment variables set by Slurm

Baseline Configuration Policy (BC Policy FY05-04) defines an additional set of environment variables with related functionality available on all systems. These variables can also be found in the Nautilus User Guide.

3.3.2.2. Loading Modules

Software modules are a very convenient way to set needed environment variables and include necessary directories in your path so commands for applications can be found. For a full discussion on modules see the Nautilus User Guide and the Navy DSRC Modules Guide.

To ensure required modules are loaded at runtime, you can load them within the batch script before the executable code by using the command: module load module_name

3.3.2.3. Compiling on the Compute Nodes

You can compile on the compute nodes, either interactively or within a job script. On most systems this is the same as compiling on the login nodes, though in some cases there are differences between the login and compute nodes. See the Nautilus User Guide for more information.

3.3.2.4. Using the Transfer Nodes

Unlike PBS and LSF, Slurm uses transfer nodes rather than a transfer queue. Before a job can run, the input data needs to be copied into a directory accessible by the job script. This can be done in a separate job script using the transfer node. Because jobs on a transfer node cost no allocation, the transfer node is advantageous for large file transfers such as during data staging or cleanup to move data left in your $WORKDIR after your application completes.

When using a transfer node, keep in mind:

  • The transfer node may have additional bandwidth for data transfers.
  • You share the node with other users, so your available compute and memory is likely lower.
  • Your allocation is not charged when using a transfer node.

See Example Scripts for an example for using the transfer nodes.

3.4. Requesting Specialized Nodes

Node types are selected by specifying the following node features: standard, viz, mla, xfer, bigmem, highclock. This is done using the --prefer (soft) or -C or --constraint (hard) options. The soft option asks Slurm to provide the node type if available but allows for another node type if not available. For example: --prefer=viz

The hard option requires Slurm to wait in the queue until the exact request can be met. Examples of the hard option are provided below.

The standard node is selected by default. There is no need to specify a standard node unless as part of a heterogeneous request.

You may also specify GPU hardware through the --gres option as discussed below.

Note: Slurm offers short versions (-C) and long versions (--constraint) of many options.

3.4.1. GPU Nodes

The graphics processing unit, or GPU, has become one of the most important types of computing technology. The GPU is made up of many synchronized cores working together for specialized tasks.

GPU nodes can be requested as viz or mla nodes (see respective sections below) or by specifying GPU hardware (i.e., a100 or a40) using the --gres directive, as follows: #SBATCH --gres:gpu:a100:4 #SBATCH --gres:gpu:a40:1

If you don't have a preference, simply ask for a number of gpus: #SBATCH --gres:gpu:2

3.4.2. Visualization Nodes

Visualization nodes are GPU nodes with specialized hardware or software to support visualization tasks. To request a visualization node, add the --constraint=viz directive, as follows: #SBATCH --constraint=viz

or #SBATCH -C viz

3.4.3. AI/ML Nodes

AI/ML nodes are GPU nodes with specialized hardware and software to support machine-learning tasks. To request an AI/ML node, use the directive: #SBATCH --constraint=mla or #SBATCH -C mla

3.4.4. High Core Performance Nodes

High Core Performance nodes have the benefit of higher clock speeds, but the drawback of lower thread size and cache size. To request high clock nodes, use the following directive. #SBATCH --constraint=highclock or #SBATCH -C highclock

3.4.5. Transfer Nodes

Transfer nodes exist to help users conserve allocation when transferring data. To request transfer nodes, use the following directive. #SBATCH --constraint=xfer or #SBATCH -C xfer

3.5. Advanced Considerations

3.5.1. Heterogeneous Computing (Using Multiple Node Types) and Node Distribution

Heterogeneous computing refers to using more than one type of node, such as CPU, GPU, or large-memory nodes. By assigning different workloads to specialized nodes suited for diverse purposes, performance and energy efficiency can be vastly improved. Node distribution refers to assigning tasks to groups or chunks of nodes. This section discusses how to schedule different node types and organize groups of nodes (heterogeneous or homogeneous) so they can be assigned different tasks.

On Nautilus, heterogeneous nodes are selected using the --constraint or -C directive, followed by the constraint format: "[type*number&type*number...]". For example, to select three mla nodes, one viz node, two bigmem nodes, and 94 standard nodes, use the following: #SBATCH --nodes=100 --constraint="[mla*3&viz*1&bigmem*2&standard*94]"

The distribution of tasks to the nodes and cores on those nodes can be controlled using the --distribution or -m directive, which has the following options:

--distribution=*|block|cyclic|arbitrary|plane=size
                   [:*|block|cyclic|fcyclic[:*|block|cyclic|fcyclic]]
                   [,Pack|NoPack]

The first distribution method (before the first ":") controls the distribution of tasks to nodes. The second distribution method controls the distribution tasks across sockets. The third controls the distribution of tasks across cores. The second and third distributions apply only if task pinning Pinning - Pinning threads for shared-memory parallelism or binding processes for distributed-memory parallelism is an advanced way to control how your system distributes the threads or processes across the available cores. is enabled.

The following table describes the distribution options:

Slurm Distribution Methods
Variable Description
block Distributes tasks to a node such that consecutive tasks share a node. This is the default distribution method
cyclic Distributes tasks to a node such that consecutive tasks are distributed over consecutive nodes (in a round-robin fashion).
plane The tasks are distributed in blocks of size size. The size must be given or $SLURM_DIST_PLANESIZE must be set. The number of tasks distributed to each node is the same as for cyclic distribution, but the task IDs assigned to each node depend on the plane size.
arbitrary Processes are allocated in the order as listed in the file designated by the environment variable $SLURM_HOSTFILE. If this variable is listed, it overrides any other method specified. If not set, the method defaults to block.
fcyclic Distributes the tasks to consecutive sockets in a round-robin fashion across the sockets. Tasks requiring more than one core have each allocated in a cyclic fashion across sockets.
pack Rather than evenly distributing a job step's tasks evenly across its allocated nodes, pack them as tightly as possible on the nodes. This only applies when the "block" task distribution method is used.
noPack Rather than packing a job step's tasks as tightly as possible on the nodes, distribute them evenly.

3.6. Advance Reservation System Jobs

The Advance Reservation Service (ARS) provides a web-based interface to batch schedulers on most allocated HPC resources in the HPCMP. This service allows allocated users to reserve resources for use at specific times and for specific durations. It works in tandem with selected schedulers to allow restricted access to those reserved resources.

For Advance Reservation System (ARS) jobs you must submit a reservation request. Upon successful completion of the reservation request, a confirmation page is presented, and an email is sent notifying you of all pertinent data concerning the reservation, including the ARS_ID. It is your responsibility to either use or cancel your reservation. Unless you cancel it, your allocation is charged for the full time on the reserved nodes whether you use them or not. For information, such as how to cancel a reservation, see the ARS User Guide.

To use the reserved nodes, you must log onto the selected system and submit a job specifying the ARS_ID, as follows: #SBATCH --reservation=ARS_ID

4. Submitting and Managing Your Job

Once your batch script is ready, you need to submit it to the scheduler for execution, and the scheduler will generate a job according to the parameters set in the script. Submitting a batch script can be done with the sbatch command: sbatch batch-script-name

Because batch scripts specify the resources for your job, you won't need to specify any resources on the command line. However, you can override or add any job parameter by providing the specific resource as a flag directly on the sbatch command line. Directives supplied in this way override the same directives if they are already included in your script. The syntax to supply directives on the command line is the same as within a script except #SBATCH is not used. For example, to override the time use: sbatch --tint=days-hh:mm:ss batch-script-name

4.1. Scheduler Dos and Don'ts

When submitting your job, it's important to keep in mind these general guidelines:

  • Request only the resources you need.
  • Be aware of limits. If you request more resources than the hardware can offer, the scheduler might not reject the job, and it may be stuck in the queue forever.
  • Be aware of the available memory limit. In general, the available memory per core is (memory_per_node)/(cores_in_use_on_the_node).
  • The scheduler might not support pinning, so you might want to do this manually.
  • There may be per-user quotas on the system.

You should also keep in mind that Nautilus is a shared resource. Behavior that negatively impacts other users or stresses the system administrators is not desirable. Below are some suggestions to be followed for a happy HPC community.

  • Submitting 1000 jobs to perform 1000 tasks is naïve and can overload the scheduler. If these tasks are serial, it also wastes your allocation hours across 1000 nodes. Job arrays are strongly encouraged, see Section 6.
  • If you expect your job to run for several days, split it into smaller jobs. You'll get reduced queue time and increased stability (e.g., against node failure). You can either split your job manually and submit as separate jobs or submit your jobs sequentially within a single script as described in the Navy DSRC Archive Guide.
  • Send your job to the right queue. It is important to understand in which queue the scheduler will run your job as most queues have core and walltime limits.
  • Do not run compute-intensive tasks from a login node. Doing so slows the login nodes, causing login delays for other users and may prompt administrators to terminate your tasks, often without notice.

4.2. Job Management Commands

Once you submit your job, there are commands available to check and manage your job submission. For example:

  • Determining the status of your job
  • Cancelling your job
  • Putting your job on hold
  • Releasing a job from hold

The table below contains commands for managing your jobs. Use man command or the command --help option to get more information about a command.

Slurm Job Management Commands
Command Description
sacct -j job_id -l Display job accounting data from a completed job.
sbatch script_file Submit a job.
srun --gres=help
or
srun --gres=list
Display a list of available resources.
scancel job_id Delete a job.
scontrol hold job_id Place a job on hold.
scontrol release job_id Release a job from hold.
sinfo Reports the state of queues and nodes managed by Slurm.
sinfo -N or --Node Display a list of nodes
scontrol show nodes Display a list of nodes with detailed node information
squeue Display the list of all jobs across all queues.
sacctmgr show qos format="name%-14,Description%-20,priority,maxwall" Display a neatly formatted list of queues
squeue -j job_id Check the status of a job.
squeue -u user_name
or
--user=user_name
Check the status of all jobs submitted by the user. Can use $USER in place of user_name.
squeue --format "specs" Display custom job information. For example, to display an output similar to PBS qstat output (i.e., account, user, queue, job name, job status, job execution time) use:
squeue --format"%.10a %.10u %.10q %.12j %.4t %M"
sstat job_id Display information about the resources utilized by a running job or job step. Can only sstat your jobs.

4.3. Job States

When checking the status of a job, the state of a job is listed. Jobs typically pass through several states during their execution. The main job cycle states are QUEUED/PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. An explanation of each state follows.

Slurm Job States
Command Description
PD The job is queued, eligible to run, or routed.
R The job is running.
S Job has an allocation, but execution has been suspended, and CPUs have been released for other jobs.
CP The job is in the process of completing.
CD The job is completed with an exit code of zero.
CA The job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
Slurm has an extensive list of states. See man squeue for a complete list.

4.4. Baseline Configuration Common Commands

The Baseline Configuration Team (BCT) has established the following set of common commands that are consistent across all systems. Most are custom and not inherent in the scheduler.

BCT Common Commands
Command Description
bcmodule Executes like the standard module command but has numerous improvements and new features.
check_license Checks the status of HPCMP shared applications grouped into two distinct categories: Software License Buffer (SLB) applications and non-SLB applications.
cqstat Displays information about jobs in the batch queueing system.
node_use Displays memory-use and load-average information for all login nodes of the system on which it is executed.
qflag Collects information about the user and the user's jobs and sends a message about the reported problem without any need to leave the HPC system.
qhist Prints a full report on a running or completed batch job with an option to include chronological log file entries for the job from the batch queueing system. The command can also list all completed jobs for a given user over a specified number of days in the past.
qpeek Returns the standard output (stdout) and standard error (stderr) messages for any submitted batch job from the start of execution until the job run is complete.
qview Displays various reports about jobs in the batch queuing system.
show_queues Displays current batch queuing system information.
show_storage Produces two reports on quota and usage information.
show_usage Produces two reports on the allocation and usage status of each subproject under which a user may compute.

5. Optional Directives

In addition to the required directives mentioned above, Slurm has many other directives, but most users only use a few of them. Some of the more useful optional directives are summarized below.

5.1. Job Application Directive (Unsupported)

The application directive allows you to identify the application being used by your job. This directive is used for HPCMP accountability and administrative purposes and helps the HPCMP accurately assess application usage and ensure adequate software licenses and appropriate software are purchased. While not required, using this directive is strongly encouraged as it provides valuable data to the HPCMP regarding application use.

Slurm currently does not support an application directive.

5.2. Job Name Directive

The job_name directive allows you to give your job a name that's easier to remember than a numeric job ID. The Slurm environment variable, $SLURM_JOB_NAME, inherits this value and can be used instead of the job ID to create job-specific output directories. To use this directive, add the following to your batch script: #SBATCH --job-name=job_name or #SBATCH -J job_name or, to your sbatch command sbatch --job-name=job_name

5.3. Job Reporting Directives

Job reporting directives allow you to control what happens to standard output and standard error messages generated by your script. They also allow you to specify e-mail options to be executed at the beginning and end of your job. The following table and sections describe the job reporting directives:

Slurm Job Reporting Directives
Directive Description
-e filename, or
--error=filename
Redirect standard error (stderr) to the named file.
-o filename, or
--output=filename
Redirect standard output (stdout) to the named file.
--open-mode=mode_to_open Specifies whether output and error files are appended or overwritten. A value of append adds the output to the file. A value of truncate overwrites the file if it exists.
--mail-user=user_name Linux user name to notify about state changes as defined by --mail-type. The default value is the submitting user.
--mail-type=event Send email when the job BEGIN, END, FAIL, ALL, TIME_LIMIT
--mail-type=END Send email when the job ends.
--mail-type=BEGIN, END Send email when the job begins and ends.
5.3.1. Redirecting stderr and stdout

By default, messages written to stdout and stderr are combined and written to a single file with a default filename slurm-%j.out, where %j is replaced by the job id. If you want to change this behavior, the -o or --output and -e or --error directives allow you to redirect stdout and stderr messages to different named files. To instruct Slurm to write stdout to a specific file, use the directive: #SBATCH --output filename.out or #SBATCH -o filename.out To instruct Slurm to write to the stderr to a specific file, use the directive: #SBATCH --error filename.err or #SBATCH -e filename.err To instruct Slurm to write to a single file, just specify stdout.

By default, Slurm overwrites output and error files. To append instead of overwriting, use the following directive: --open-mode=append

5.3.2. Setting up E-mail Alerts

Mail is sent to the email address associated with your pIE account. Many users want to be notified when their jobs begin and end. The --mail-user and --mail-type directives make this possible. If you use the --mail-user directive, you must supply the directive with one or more e-mail addresses to be used. For example: #SBATCH --mail-user=user

If you use the --mail-type directive, you must supply the directive with the event type, which can be BEGIN, END, FAIL, ALL, or TIME_LIMIT, TIME_LIMIT_90 (90% of time limit), TIME_LIMIT_80, TIME_LIMIT_50, and ARRAY_TASKS (mail for each array task). For example: #SBATCH --mail-type=BEGIN,END

5.4. Job Environment Directives

Job environment directives allow you to control the environment in which your script will operate. This section describes some useful variables in setting up the script environment.

Slurm Job Environment Directives
Directive Description
srun --pty slurm_options --x11 bash -i Use srun to request an interactive job on compute nodes from a pseudoterminal (--pty) and running an interactive shell (bash -i) within that terminal.
salloc slurm_options --x11 bash Request an interactive job that runs on a login node to allow multiple srun commands to launch executables within that single interactive job.
#SBATCH --export=ALL Export all environment variables from your login environment into your batch environment.
#SBATCH --export=NONE Export no environment variables from your login environment into your batch environment.
#SBATCH --export=variable1, variable2 Export specific environment variables from your login environment into your batch environment.
#SBATCH --mem=size[K|M|G|T] Memory size per node.
5.4.1. Set Working Directory

By default, Slurm executes your job from the current directory where you submit the job. To change the work directory cd to it in the script. You can also set the working directory of the batch script to path before it is executed using the -D or --chdir directive. --chdir=path or -D path

The path can be a full or relative path to the directory where the command is executed. The advantage of using the --chdir directive (instead of cd in the script) is that stdout/stderr output files are also output into the new directory.

5.4.2. Interactive Batch Shell

When you log into Nautilus, you will be running in an interactive shell on a login node. The login nodes provide login access for Nautilus and support such activities as compiling, editing, and general interactive use by all users. Please note the Navy DSRC Login Node Abuse policy.

The preferred method to run resource intensive interactive executions is to use an interactive batch session. An interactive batch session allows you to run interactively (in a command shell) on a compute node after waiting in the batch queue.

Note: Once an interactive session starts, it uses the entire requested block of CPU time and other resources unless you exit from it early, even if you don't use it. To avoid unnecessary charges to your project, don't forget to exit an interactive session once finished.

A Slurm interactive session reserves resources on compute nodes, allowing you to use them interactively as you would the login node. Two main commands can be used to make a session, srun and salloc, both of which use most of the same options available to sbatch.

To use the interactive batch environment, you must first acquire an interactive batch shell. This is done by adding the --pty directive to your srun command. For example: srun --pty your_slurm_options --x11 /bin/bash

The Slurm options for your job are described in Required Scheduler Directives above. Note the command requires you to specify your shell, which in this case is the bash shell. The --x11 directive enables X-Windows access, so it may be omitted if your interactive job does not use a GUI.

salloc functions similarly to srun --pty bash in that it adds your job to the queue. However, salloc starts a new bash session on the login node. Any subsequent srun commands in that session occur within the running job on the compute nodes. This is useful for running multiple executions with srun within a single job. For example:

salloc your_slurm_options --x11 /bin/bash
srun executable1          # Run non-interactive executable
srun executable2          # Run another non-interactive executable
srun --pty --x11 bash -i  # Run interactive session (only one at a time)

Interactive batch sessions are scheduled just like normal batch jobs, so depending on how many other batch jobs are queued, it may take some time. Once your interactive batch shell starts, you will be logged into the first compute node of those assigned to your job. At this point, you can run or debug interactive applications, execute job scripts, post-process data, etc. You can launch parallel applications on your assigned compute nodes by using an MPI or other parallel launch command.

The HPC Interactive Environment (HIE) provides an HIE queue specifically for interactive jobs. It offers longer job times and has nodes reserved only for HIE, so queue wait times are sometimes much shorter. However, HIE has limitations, such as only allowing the use of a single node at a time. See the HIE User Guide for more information before using the HIE queue.

5.4.3. Export Environment Variables

Batch jobs run with their own environment, separate from the login environment from which the batch job is launched. If your application is dependent on environment variables set in the login environment, you need to export these variables from the login environment to the batch environment.

The --export directive tells Slurm to export all environment variables from your login environment to your batch environment. SLURM_* variables are always propagated. To use this directive, add the following line to your batch script: #SBATCH --export=ALL

To export none of your own environment variables and only SLURM_* variables from your environment use the directive: #SBATCH --export=NONE

To export all SLURM_* environment variables along with explicitly defined variables use the directive: #SBATCH --export=my_variable1,my_variable2,...

It is also possible to set values for variables exported in this way, as follows: #SBATCH --export=my_variable=my_value,my_variable2=my_value2...

5.4.4. Memory Size

The --mem=size directive is used to specify the maximum amount of memory required per node in the job in bytes. For example, if the job needs up to 2 GB of memory, then the directive would read: #SBATCH --mem=2G

5.5. Job Dependency Directives


Slurm Job Dependency Directives
Directive Description
after:job_id[[+time][:job_id[+time]...]] This job may begin time minutes after the specified jobs start or are cancelled. If no time is given there is no delay after start or cancellation.
afterany:job_id[:job_id...] This job may begin after the specified jobs terminate. This is the default dependency type.
aftercorr:job_id[:job_id...] A task of this job array may begin after the corresponding task ID in the specified job completes successfully (i.e., runs to completion with an exit code of zero).
afternotok:job_id[:job_id...] This job may begin after the specified jobs terminate in a failed state (non-zero exit code, node failure, timed out, etc.).
afterok:job_id[:job_id...] This job may begin after the specified jobs run successfully (i.e., to completion with a zero exit code).
singleton This job may begin after the termination of any previously launched jobs that have the same job name and user.

Job dependency directives allow you to specify dependencies your job may have on other jobs. This allows you to control the order jobs run in. These directives generally take the following form: #SBATCH --dependency=dependency_expression or #SBATCH -d dependency_expression where dependency_expression is a comma-delimited list of one or more dependencies, and each dependency is of the form: type:job_id[:job_id][,type:job_id[:job_id]] or type:job_id[:job_id][?type:job_id[:job_id]] where type is one of the directives listed below, and job_id is a colon-delimited list of one or more job IDs your job is dependent upon. All dependencies must be satisfied if the "," separator is used. Any dependency may be satisfied if the "?" separator is used. For more information about job dependencies, see the sbatch man page.

5.6. Slurm Input Environment Variables

In addition to environment variables inherited from your user environment and environment variables set by the scheduler, Slurm has environment variables that can be set by the user. Upon startup, sbatch reads and handles the options set in the following environment variables. While there are many environment variables, you only need to know a few important ones to get started. Commonly used environment variables you can set include:

Slurm User Environment Variables
Variable Description
$SBATCH_ACCOUNT Account name associated of the job allocation. Same as -A or --account
$SBATCH_ARRAY_INX Submit a job array. Same as -a or --array.
$SBATCH_CONSTRAINT Specify which constraints are required by their job. Same as -C or --constraint.
$SBATCH_DISTRIBUTION Same as -m or --distribution
$SBATCH_EXPORT Same as --export
$SBATCH_GET_USER_ENV Retrieve the login environment variables. Same as --get-user-env
$SBATCH_GRES Specifies a comma-delimited list of node types. Same as --gres.
$SBATCH_GRES_FLAGS Specify node task pinning options. Same as --gres-flags.
$SBATCH_MEM_PER_NODE Specify the real memory required per node. Same as --mem
$SBATCH_QOS Request a quality of service for the job. Same as -qos or -q.
$SBATCH_RESERVATION Allocate resources from the named reservation. Same as --reservation
$SBATCH_TIMELIMIT Set a limit on the total run time. Same as -t or --time
See the sbatch man page for a complete list of environment variables.

6. Job Arrays

Imagine you have several hundred jobs that are all identical except for two or three parameters whose values vary across a range of input values. Submitting all these jobs individually would not only be tedious but would also incur a lot of overhead, which would impose a significant strain on the scheduler, negatively impacting all scheduled jobs. This example is not an uncommon use case, and it is the reason why job arrays were invented.

Job arrays let you submit and manage collections of similar jobs quickly and easily within a single script, which can significantly relieve the strain on the queueing system. Resource directives are specified once at the top of a job array script and are applied to each array task. As a result, each task has the same initial options (e.g., size, wall time, etc.) but may have different input values.

If your use case includes 200 or more similar jobs that vary by only a few parameters, job arrays are highly recommended.

To implement a Slurm job array in your job script, include the directives: #SBATCH --array=n-m[:step],... or #SBATCH -a n-m[:step],... where n is the starting index, m is the ending index, and the optional step is the increment. Slurm then queues this script in FLOOR[(m-n)/step+1] instances, each of which receives its index in the $SLURM_ARRAY_TASK_ID environment variable. You can use the command echo $SLURM_ARRAY_TASK_ID to output the unique index of a job instance.

A step function is specified with a suffix containing a colon and number. For example, #SBATCH --array=0-15:4 which is equivalent to: #SBATCH --array=0,4,8,12

When working with Slurm job arrays, other Slurm environment variables that come into play include:

  • SLURM_ARRAY_JOB_ID - Job array's master job ID number.
  • SLURM_ARRAY_TASK_COUNT - Total number of tasks in a job array.
  • SLURM_ARRAY_TASK_MAX - Job array's maximum ID (index) number.
  • SLURM_ARRAY_TASK_MIN -Job array's minimum ID (index) number.
  • SLURM_ARRAY_TASK_STEP -Job array's index step size.

See the sbatch man page for more information.

7. Example Scripts

This section provides sample scripts you may copy and use. All scripts follow the anatomy presented in Section 3 and have been tested in their respective scheduler environment. When you use any of these examples, remember to substitute your own Project_ID, job name, output and error files, executable, and clean up. More advanced scripts can be found under the $SAMPLES_HOME directory on the system. Assorted flavors of Hello World are provided in Section 8. These simple programs can be used to test these scripts.

The following Baseline Configuration variables are used in the scripts below.

Baseline Configuration Batch Variables
Variable Description
$BC_CORES_PER_NODE The number of cores per node for the compute node on which a job is running.
$BC_MEM_PER_NODE The approximate maximum memory per node available to an end user program (in integer MB) for the compute node type to which a job is being submitted.
$BC_MPI_TASKS_ALLOC Intended to be referenced from inside a job script, contains the number of MPI tasks/ranks allocated for a particular job.
$BC_NODE_ALLOC Intended to be referenced from inside a job script, contains the number of nodes allocated for a particular job.

7.1. Simple Batch Script

The following is a very basic script to demonstrate requesting resources (including all required directives), setting up the environment, specifying the execution block (i.e., the commands to be executed), and cleaning up after your job completes. Save this as a regular text file using your editor of choice.

#!/bin/bash
#################################################################
# Description:  This basic bash shell script for`a simple job.
#               The job can be submitted to the standard queue
#               with the sbatch command.
# Use the "show_usage" command to get your PROJECT_ID(s).
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
# Account to be charged
#SBATCH --account=Project_ID
## or
##SBATCH -A Project_ID

# Run the job in the standard queue
#SBATCH -q standard

# Select 4 nodes
#SBATCH --nodes=4
## or
##SBATCH -N 4

# Total tasks count
#SBATCH --ntasks=8
## or
##SBATCH -n 8

# Set max wall time to 10 minutes
#SBATCH --time=00:10:00
## or
##SBATCH -t 00:10:00

##################################################################
# OPTIONAL DIRECTIVES  -------------------------------------------
##################################################################
# Name the job 'jobName'
#SBATCH --job-name=jobName
## or
## SBATCH -J jobName

# Change stdout and stderr filenames
#SBATCH --output=filename.out
## or
## SBATCH -o filename.out

#SBATCH --error=filename.err
## or
## SBATCH -e filename.err

##################################################################
# EXECUTION BLOCK -------------------------------------------------
##################################################################
# Change to the default working directory
cd ${WORKDIR}
echo "working directory is ${WORKDIR}"

# Run the job to the default working directory
echo
echo "-----------------------"
echo "-- Executable Output --"
echo "-----------------------"
mpirun ./executable

##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and
# move data to non-scratch directory (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

exit

7.2. Job Information Batch Script

The following examples can be included in the Execution block of any job script. The first example shows Baseline Configuration environment variables available on all HPCMP systems. The second example shows scheduler-specific variables.

#################################################################
# Job information set by Baseline Configuration variables
#################################################################
echo ----------------------------------------------------------
echo "Type of node                    " $BC_NODE_TYPE
echo "CPU cores per node              " $BC_CORES_PER_NODE
echo "CPU cores per standard node     " $BC_STANDARD_NODE_CORES
echo "CPU cores per accelerator node  " $BC_ACCELERATOR_NODE_CORES
echo "CPU cores per big memory node   " $BC_BIGMEM_NODE_CORES
echo "Hostname                        " $BC_HOST
echo "Maxumum memory per nodes        " $BC_MEM_PER_NODE
echo "Number of tasks allocated       " $BC_MPI_TASKS_ALLOC
echo "Number of nodes allocated       " $BC_NODE_ALLOC
echo "Working directory               " $WORKDIR
echo ----------------------------------------------------------

##############################################################
# Output some useful job information.  
##############################################################
echo "-------------------------------------------------------"
echo "Project ID                      " $SLURM_JOB_ACCOUNT
echo "Job submission directory        " $SLURM_SUBMIT_DIR
echo "Submit host                     " $SLURM_SUBMIT_HOST
echo "Job name                        " $SLURM_JOB_NAME
echo "Job identifier (SLURM_JOB_ID)   " $SLURM_JOB_ID
echo "Job identifier (SLURM_JOBID)    " $SLURM_JOBID
echo "Working directory               " $WORKDIR
echo "Job partition                   " $SLURM_JOB_PARTITION
echo "Job queue (QOS)                 " $SLURM_JOB_QOS
echo "Job number of nodes             " $SLURM_JOB_NUM_NODES
echo "Job node list                   " $SLURM_JOB_NODELIST
echo "Number of nodes                 " $SLURM_NNODES
echo "Number of tasks                 " $SLURM_NTASKS
echo "Node list                       " $SLURM_NODELIST
echo "-------------------------------------------------------"
echo

7.3. OpenMP Script

To run a pure OpenMP job, specify the number of cores you want from the node (ncpus). Also specify the number of threads (ompthreads) or $OMP_NUM_THREADS defaults to the value of ncpus, possibly resulting in poor performance. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
# Account to be charged
#SBATCH --account=Project_ID

# Run the job in the standard queue
#SBATCH -q standard

# Select 4 nodes
#SBATCH --nodes=4

# Total tasks count
#SBATCH --ntasks=8

# Set max wall time to 10 minutes
#SBATCH --time=00:10:00

##################################################################
# OPTIONAL DIRECTIVES  -------------------------------------------
##################################################################
# Name the job 'jobName'
#SBATCH --job-name=jobName

# Change stdout and stderr filenames
#SBATCH --output=filename.out
#SBATCH --error=filename.err

##################################################################
# EXECUTION BLOCK -------------------------------------------------
##################################################################
# Change to the default working directory
cd ${WORKDIR}
echo "working directory is ${WORKDIR}"

export OMP_NUM_THREADS=${BC_CORES_PER_NODE}

# Run the job from the default working directory
./openMP_executable

##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and
# move data to non-scratch directory (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

exit

7.4. Hybrid (MPI/OpenMP) Script

Hybrid MPI/OpenMP scripts are required for executables that MPI between cores and OpenMP inside each core. The following script is an example of hybrid MPI and OpenMP. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
# Account to be charged
##SBATCH --account=Project_ID

# Run the job in the standard queue
#SBATCH -q standard

# Select 4 nodes
#SBATCH --nodes=4

# Total tasks count, 1 per node
#SBATCH --ntasks=4

# Set max wall time to 10 minutes
#SBATCH --time=00:10:00

##################################################################
# OPTIONAL DIRECTIVES  -------------------------------------------
##################################################################
# Name the job 'jobName'
#SBATCH --job-name=jobName

# Change stdout and stderr filenames
#SBATCH --output=filename.out
#SBATCH --error=filename.err

##################################################################
# EXECUTION BLOCK -------------------------------------------------
##################################################################
# Change to the default working directory
cd ${WORKDIR}
echo "working directory is ${WORKDIR}

# One thread per core on each node
export OMP_NUM_THREADS=$BC_CORES_PER_NODE

# Run the job to the default working directory
mpirun ./hybrid_executable

##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and
# move data to non-scratch directory (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

7.5. Accessing More Memory per Process

By default, an MPI job runs one process per core, with all processes sharing the available memory on the node. On Nautilus each compute node has 128 cores and 237 GB of memory. Assuming one process per core, the memory per process is: memory per process = 237 GB/128

If you need more memory per process, then your job needs to run fewer MPI processes per node. This means number_of_processes_per_node < 128. For example, if you were to request 4 nodes and use only 16 out of 128 cores, this will result in a total of 4*16=64 MPI processes. Each of the 16 MPI process per node will have access to approximately 237 GB/16 of memory.

The following script demonstrates this example by requesting 4 nodes and setting 16 processes per node. The job runs for 2 hours in the standard queue. For more information, refer to the Samples section in the Nautilus User Guide. Note: Differences between the Simple Batch Script and this script are highlighted.

Another way to get more memory per process is to run on bigmem nodes, which is discussed in the next section. However, because there are few bigmem nodes on the system, if you need many cores, bigmem nodes may not be an option.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
# Account to be charged
##SBATCH --account=Project_ID

# Run the job in the standard queue
#SBATCH -q standard

## Start 64 MPI processes; only 16 processes on each node 
# This results in each process having a memory size of 237 GB / 16
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=16

# Set max wall time to 2 hours
#SBATCH --time=02:00:00

##################################################################
# OPTIONAL DIRECTIVES  -------------------------------------------
##################################################################
# Name the job 'jobName'
#SBATCH --job-name=jobName

# Change stdout and stderr filenames
#SBATCH --output=filename.out
#SBATCH --error=filename.err

##################################################################
# EXECUTION BLOCK -------------------------------------------------
##################################################################
# Change to the default working directory
cd ${WORKDIR}
echo "working directory is ${WORKDIR}"


# Run the job to the default working directory

##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and
# move data to non-scratch directory (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

exit

7.6. GPU Script

Here is a short example of a script for submitting jobs to a GPU node. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
# Account to be charged
#SBATCH --account=Project_ID

# Run the job in the standard queue
#SBATCH -q standard

# Select 4 nodes
#SBATCH --nodes=4

# Total tasks count
#SBATCH --ntasks=8

# Set max wall time to 10 minutes
#SBATCH --time=00:10:00

# Request GPU nodes - select one of the options below
##SBATCH --gres:gpu:a100:2  # Same as mla
##SBATCH --gres:gpu:a40:2   # Same as viz
##SBATCH --gres:gpu:2
##SBATCH --constraint="viz"
##SBATCH --constraint="mla"

##################################################################
# OPTIONAL DIRECTIVES  -------------------------------------------
##################################################################
# Name the job 'jobName'
# SBATCH --jobname=jobName

# Change stdout and stderr filenames
# SBATCH --output=filename.out
# SBATCH --error=filename.err

##################################################################
# EXECUTION BLOCK ------------------------------------------------
##################################################################
# Change to the default working directory
cd ${WORKDIR}
echo "working directory is ${WORKDIR}"

# Run the job to the default working directory
./GPU_executable

##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and
# move data to non-scratch directory (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

exit

7.7. Data Transfer Script

The transfer queue is a special-purpose queue for transferring or archiving files. It has access to $HOME, $ARCHIVE_HOME, $WORKDIR, and $CENTER. Jobs running in the transfer queue are charged for a single core against your allocation. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
# Account to be charged
#SBATCH --account=Project_ID

# Set max wall time to ten minutes
#SBATCH --time=00:10:00

# Request transfer nodes
#SBATCH --constraint="xfer"

##################################################################
# OPTIONAL DIRECTIVES  -------------------------------------------
##################################################################
# Name the job 'jobName'
# SBATCH --jobname=jobName

# Change stdout and stderr filenames
# SBATCH --output=filename.out
# SBATCH --error=filename.err

##################################################################
# EXECUTION BLOCK ------------------------------------------------
##################################################################
cd $WORKDIR
echo "New directory = ${WORKDIR}"

# Assume all files to be transferred from are in $WORKDIR/from_dir
export FROM_DIR=$WORKDIR/from_dir

# Assume all files are to be transferred to $ARCHIVE_HOME
export TO_DIR=$ARCHIVE_HOME

# Create a gzip file to reduce data transfer time
tar -czf $FROM_DIR.gz .

# If needed, uncomment to create a directory on the archive
# archive mkdir -C $TO_DIR

# Use archive commands to transfer the data
archive put -C $TO_DIR $FROM_DIR.gz

# List the archive directory contents to verify data transfer
archive ls -al -C $TO_DIR

echo "Transfer job ended"

7.8. Job Array Script

As was discussed in Section 6, job arrays allow you to leverage a scheduler's ability to create multiple jobs from one script. Many of the situations where this is useful include:

  • Establishing a list of commands to run and have a job created from each command in the list.
  • Running many parameters against one set of data or analysis program.
  • Running the same program multiple times with different sets of data.
#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
# Account to be charged
#SBATCH --account=Project_ID

# Run the job in the standard queue
#SBATCH -q standard

# Select 4 nodes
#SBATCH --nodes=4

# Total tasks count
#SBATCH --ntasks=8

# Set max wall time to 10 minutes
#SBATCH --time=00:10:00

##################################################################
# OPTIONAL DIRECTIVES  -------------------------------------------
##################################################################
# Name the job 'jobName'
# SBATCH --jobname=jobName

# Change stdout and stderr filenames
# SBATCH --output=filename.out
# SBATCH --error=filename.err

##################################################################
# EXECUTION BLOCK ------------------------------------------------
##################################################################
cd $WORKDIR/SCRIPTS
export JA_ID=$SLURM_JOB_ARRAY_ID
export JA_DIR=$WORKDIR/Job_Array.o${JA_ID}

# Output Job ID and Job array index information
echo "Slurm Job Array ID is $SLURM_JOB_ARRAY_ID"
echo "Slurm Job array index is $SLURM_ARRAY_TASK_ID"
echo "Slurm Job ID is $JA_ID"
echo "Job array directory is $JA_DIR"
#
# Make a directory for each task in the array
mkdir $JA_DIR
#
# Change into to task specific directory to run each task
cd $JA_DIR
#
# Retrieve the job's binary
cp $WORKDIR/executable $JA_DIR/executable_$SLURM_ARRAY_TASK_ID
#
# Run job and redirect output
export outfile=$JA_DIR/$JA_ID_$SLURM_ARRAY_TASK_ID
mpirun $JA_DIR/executable_$SLURM_ARRAY_TASK_ID &> $outfile

7.9. Large-Memory Node Script

The standard compute nodes on Nautilus contain 237 GB of RAM and 128 cores. That works out at 1.85 GB/core. This is fine for most jobs running on the system. However, some jobs require more memory per core. To accommodate these jobs, Nautilus has 16 large-memory nodes with 998 GB of memory. You can allocate a job on the large-memory nodes by submitting a large-memory job script. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
# Account to be charged
#SBATCH --account=Project_ID

# Run the job in the standard queue
#SBATCH -q standard

# Select 4 nodes
#SBATCH --nodes=4

# Total tasks count
#SBATCH --ntasks=8

# Set max wall time to 10 minutes
#SBATCH --time=00:10:00

# Request big memory nodes
#SBATCH --constraint=bigmem

##################################################################
# OPTIONAL DIRECTIVES  -------------------------------------------
##################################################################
# Name the job 'jobName'
# SBATCH --jobname=jobName

# Change stdout and stderr filenames
# SBATCH --output=filename.out
# SBATCH --error=filename.err

##################################################################
# EXECUTION BLOCK ------------------------------------------------
##################################################################
# Change to the default working directory
cd ${WORKDIR}
echo "working directory is ${WORKDIR}"

# Run the job to the default working directory
mpirun ./executable

##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and
# move data to non-scratch directory (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

8. Hello World Examples

This section provides code to differing examples of the basic hello.c program. Refer to the Nautilus User Guide for information about compiling.

8.1. C Program - hello.c

/**************************************************************
* A simple program to demonstrate an MPI executable
***************************************************************/
#include <mpi.h>
#include <stdio.h>

int rank;
int numNodes;
char processorName[MPI_MAX_PROCESSOR_NAME];
int nameLen;

int main(int argc, char** argv) {

    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of nodes
    MPI_Comm_size(MPI_COMM_WORLD, &numRanks);

    // Get the rank of this process
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // Get the name of the processor
    MPI_Get_processor_name(processorName, &nameLen);

    // Print messages from each processor
    printf("Hello from processor %s - ", processorName);
    printf("I am rank %d out of %d ranks\n", rank, numRanks);

    // Finalize the MPI environment
    MPI_Finalize();
} // end main

8.2. OpenMP - hello-OpenMP.c

/*************************************************************
* A simple program to demonstrate a pure OpenMPI executable
***************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>           // needed for OpenMP
#include <unistd.h>        // only needed for definition of gethostname
#include <sys/param.h>     // only needed for definition of MAXHOSTNAMELEN

int main (int argc, char *argv[]) {
  int th_id, nthreads;
  char foo[] = "Hello";
  char bar[] = "World";
  char hostname[MAXHOSTNAMELEN];
  gethostname(hostname, MAXHOSTNAMELEN);
  #pragma omp parallel private(th_id)
  {
    th_id = omp_get_thread_num();
    printf("%s %s from thread %d on %s!\n", foo, bar, th_id, hostname);
    #pragma omp barrier
    if ( th_id == 0 ) {
      nthreads = omp_get_num_threads();
      printf("There were %d threads on %s!\n", nthreads, hostname);
    }
  }
  return EXIT_SUCCESS;
}

8.3. Hybrid MPI/Open MP - hello-hybrid.c

/*****************************************************************
* A simple program to demonstrate a Hybrid MPI/OpenMP executable
*****************************************************************/
#include <stdio.h>
#include <omp.h>
#include "mpi.h"

int main(int argc, char *argv[]) {
    int numprocs, rank, namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int iam = 0, np = 1;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name, &namelen);

    #pragma omp parallel default(shared) private(iam, np)
    {
        np = omp_get_num_threads();
        iam = omp_get_thread_num();
        printf("Hello from thread %d of %d, 
                   iam, np);
        printf( from process %d out of %d on %s\n",
                   rank, numprocs, processor_name);
    }

    MPI_Finalize();
}

8.4. Cuda - hello-cuda.cu

/*****************************************************************
* A simple program to demonstrate a CUDA/GPU executable
* May require a module swap: 
*		module swap PrgEnv-cray PrgEnv-nvidia
* Check User manual for compiling GPU code
******************************************************************/
#include <stdio.h>
#include <stdlib.h>

#include <cuda.h>

void cuda_device_init(void)
 {
    int ndev;
    cudaGetDeviceCount(&ndev);
    cudaDeviceSynchronize();
    if (ndev == 1)
       printf("There is %d GPU.\n",ndev);
    else
       printf("There are %d GPUs\n",ndev);

    for(int i=0;i<ndev;i++) {
       cudaDeviceProp pdev;
       cudaGetDeviceProperties(&pdev,i);
       cudaDeviceSynchronize();
       printf("Hello from GPU %d\n",i);
       printf("GPU type  : %s\n",pdev.name);
       printf("Memory Global: %d Mb\n",\
                       (pdev.totalGlobalMem+1024*1024)/1024/1024);
       printf("Memory Const : %d Kb\n",pdev.totalConstMem/1024);
       printf("Memory Shared: %d Kb\n",pdev.sharedMemPerBlock/1024);
       printf("Clock Rate  : %.3f GHz\n",pdev.clockRate/1000000.0);
       printf("Number of Processors  : %d\n",pdev.multiProcessorCount);
       printf("Number of Cores  : %d\n",8*pdev.multiProcessorCount);
       printf("Warp Size : %d\n",pdev.warpSize);
       printf("Max Thr/Blk  : %d\n",pdev.maxThreadsPerBlock);
       printf("Max Blk Size : %d %d %d\n",\
                       pdev.maxThreadsDim[0],pdev.maxThreadsDim[1],\
                       pdev.maxThreadsDim[2]);
       printf("Max Grid Size: %d %d %d\n",\
                       pdev.maxGridSize[0],pdev.maxGridSize[1],\
                       pdev.maxGridSize[2]);
    }
}

int main(int argc, char * argv[]) {

   cuda_device_init();
   return 0;
}

/**************************************************************
* Compile Script for hello-cuda on Nautilus
***************************************************************/
#!/bin/bash
#
$MODULESHOME/init/bash
module unload amd/aocc amd/aocl
module load nvidia/nvhpc#
#
set -x
#
nvcc -o hello-cuda.exe hello-cuda.cu

9. Batch Scheduler Rosetta


Batch Scheduler Rosetta
User Commands PBS Slurm LSF
Job Submission qsub Script_File sbatch Script_File bsub < Script_File
Job Deletion qdel Job_ID scancel Job_ID bkill Job_ID
Job status
(by job)
qstat Job_ID squeue Job_ID bjobs Job_ID
Job status
(by user)
qstat -u User_Name squeue -u User_Name bjobs -u User_Name
Job hold qhold Job_ID scontrol hold Job_ID bstop Job_ID
Job release qrls Job_ID scontrol release Job_ID bresume Job_ID
Queue list qstat -Q squeue bqueues
Node list pbsnodes -l sinfo -N OR
scontrol show nodes
bhosts
Cluster status qstat -a sinfo bqueues
GUI xpbsmon sview xlsf OR
xlsbatch
EnvironmentPBSSlurmLSF
Job ID $PBS_JOBID $SLURM_JOBID $LSB_JOBID
Submit Directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR $LSB_SUBCWD
Submit Host $PBS_O_HOST $SLURM_SUBMIT_HOST $LSB_SUB_HOST
Node List $PBS_NODEFILE $SLURM_JOB_NODELIST $LSB_HOSTS/LSB_MCPU_HOST
Job Array Index $PBS_ARRAYID $SLURM_ARRAY_TASK_ID $LSB_JOBINDEX
Job SpecificationPBSSlurmLSF
Script Directive #PBS #SBATCH #BSUB
Queue -q Queue_Name ARL: -p Queue_Name
AFRL and Navy: -q Queue_Name
-q Queue_Name
Node Count -l select=N1:ncpus=N2: mpiprocs=N3

(N1 = Node count
N2 = Max cores per node
N3 = Cores to use per node)
-N min[-max] -n CoreCount -R "span[ptile=CoresPerNode]"

(NodeCount = CoreCount / Cores Per Node)
Core Count -l select=N1:ncpus=N2: mpiprocs=N3
(N1 = Node count
N2 = Max cores per node
N3 = Cores to use per node
Core Count = N1 x N3)
--ntasks=total_cores_in_run -n Core_Count
Wall Clock Limit -l walltime=hh:mm:ss -t min OR
-t days-hh:mm:ss
-W hh:mm
Standard Output File -o File_Name -o File_Name -o File_Name
Standard Error File -e File_Name -e File_Name -e File_Name
Combine stdout/err -j oe (both to stdout) OR
-j eo (both to stderr)
(use -o without -e) (use -o without -e)
Copy Environment -V --export=ALL|NONE|Variable_List
Event Notification -m [a][b][e] --mail-type=[BEGIN],[END],[FAIL] -B or -N
Email Address -M Email_Address --mail-user=Email_Address -u Email_Address
Job Name -N Job_Name --job-name=Job_Name -J Job_Name
Job Restart -r y|n --requeue OR
--no-requeue
(NOTE: configurable default)
-r
Working Directory No option – defaults to home directory --workdir=/Directory/Path No option – defaults to submission directory
Resource Sharing -l place=scatter:excl --exclusive OR
--shared
-x
Account to charge -A Project_ID --account=Project_ID -P Project_ID
Tasks per Node -l select=N1:ncpus=N2: mpiprocs=N3

(N1 = Node count
N2 = Max cores per node
N3 = Cores to use per node)
--tasks-per-node=count
Job Dependency -W depend=state:Job_ID[:Job_ID...][,state:Job_ID[:Job_ID...]] --depend=state:Job_ID -w done|exit|finish
Job host preference --nodelist=nodes AND/OR
--exclude=nodes
-m Node_List
(i.e., "inf001" -or- inf[001-128])
OR
-m node_type
(i.e., "inference", "training", or "visualization")
Job Arrays -J N-M[:step][%Max_Jobs] --array=N-M[:step] -J "Array_Name[N-M[:step]][%Max_Jobs]"
(Note: bold black brackets are literal)
Generic Resources -l other=Resource_Spec --gres=Resource_Spec
Licenses -l app=number
Example: -l abaqus=21
(Note: license resource allocation)
-L app:number Example -L abaqus:21
-R "rusage[License_Spec]"
(Note: brackets are literal)
Begin Time -a [[[YYYY]MM]DD]hhmm[.ss]
(Note: no delimiters)
--begin=YYYY-MM-DD[Thh:mm[:ss]] -b [[YYYY:][MM:]DD:]hh:mm

10. Glossary

Batch-scheduled
:
users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available
Batch Script
:
A script that provides resource requirements and commands for the job.
Pinning
:
Pinning threads for shared-memory parallelism or binding processes for distributed-memory parallelism is an advanced way to control how your system distributes the threads or processes across the available cores.