Narwhal PBS Guide

1. Introduction

On large-scale computers, many users must share available resources. Because of this, you can't just log on to one of these systems, upload your programs, and start running them. Essentially, your programs must "get in line" and wait their turn, and there is more than one of these lines, or queues, from which to choose. Some queues have a higher priority than others (like the express checkout at the grocery store). The queues available to you are determined by the projects you are involved with.

To perform any task on the compute cluster, you must submit it as a "job" to a special piece of software called the scheduler or batch queueing system. At its most basic, a job can be a command non-interactively, but any command (or series of commands) you want to run on the system is called a job.

Before you can submit your job to the scheduler, you must describe it, usually in the form of a batch script. The batch script specifies the computing resources needed, identifies an application to be run (along with its input data and environment variables), and describes how best to deliver the output data.

The process of using a scheduler to run the job is called batch job submission. When you submit a job, it is placed in a queue with jobs from other users. The scheduler then manages which jobs run, where, and when. Without the scheduler users could overload the system, resulting in tremendous performance degradation for everyone. The queuing system runs your job as soon as it can do so while still honoring the following:

  • Meeting your resource requests
  • Not overloading the system
  • Running higher priority jobs first
  • Maximizing overall throughput

The process can be summarized as:

  1. Create a batch script.
  2. Submit a job.
  3. Monitor a job.

1.1. Document Scope

This document provides an overview and introduction to the use of the PBS batch scheduler on the HPE Cray EX (Narwhal) located at the Navy DSRC. The intent of this guide is to provide information to enable the average user to submit jobs on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:

  • Use of the Linux operating system
  • Use of an editor (e.g., vi or emacs)
  • Remote use of computer systems via network
  • A selected programming language and its related tools and libraries

We suggest you review the Narwhal User Guide before using this guide.

2. Resources and Queue Information

2.1. Resource Summary

When working on an HPC system you must specify the resources your job needs to run. This lets the scheduler find the right time and place to schedule your job. Strict adherence to resource requests allows PBS to find the best possible place for your jobs and ensures no user can use more resources than they've been given. You should always try to specify resource limits that are close to but greater than your requirements so your job can be scheduled more quickly. This is because PBS must wait until the requested resources are available before it can run your job. You cannot request more resources than are available on the system, and you cannot use more resources than you request. If you do, your job may be rejected, fail, or remain indefinitely in the queue.

Narwhal is a batch-scheduled Batch-scheduled - users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available HPC system with numerous nodes. All jobs that require large amounts of system resources must be submitted as a batch script Batch Script - A script that provides resource requirements and commands for the job.. As discussed in Section 3, scripts are used to submit a series of directives that define the resources required by your job. The most basic resources include time, nodes, and memory.

2.2. Node Information

Below is a summary of node types available on Narwhal. Refer to the Narwhal User Guide for in-depth information.

  • Login nodes - Access points for submitting jobs on Narwhal. Login nodes are intended for basic tasks such as uploading data, managing files, compiling software, editing scripts, and checking on or managing your jobs. DO NOT run your computations on the login nodes.
  • Compute nodes - Node types such as "Standard ", "Large-Memory", "GPU", etc. are considered compute nodes. Compute nodes can include:
    • Standard nodes - The compute node type that is standard on Narwhal.
    • Large-Memory Nodes - Large-memory nodes have more memory than standard nodes and are intended for jobs that require a large amount of memory.
    • Single-GPU Machine Learning Accelerator (Single-GPU MLA) nodes - MLA nodes are specialized GPU nodes intended for machine learning and other compute-intensive applications.
    • Dual-GPU Machine Learning Accelerator (Dual-GPU MLA) nodes – MLA nodes are specialized GPU nodes intended for machine learning and other compute-intensive applications.
    • Visualization nodes - Visualization nodes are GPU nodes intended for visualization applications.

A summary of the node configuration on Narwhal is presented in the following table.

Node Configuration
Login Standard Large-Memory Visualization Single-GPU MLA Dual-GPU MLA
Total Nodes 11 2,304 26 16 32 32
Processor AMD 7H12 Rome AMD 7H12 Rome AMD 7H12 Rome AMD 7H12 Rome AMD 7H12 Rome AMD 7H12 Rome
Processor Speed 2.6 GHz 2.6 GHz 2.6 GHz 2.6 GHz 2.6 GHz 2.6 GHz
Sockets / Node 2 2 2 2 2 2
Cores / Node 128 128 128 128 128 128
Total CPU Cores 1,408 294,912 3,328 2,048 4,096 4,096
Usable Memory / Node 226 GB 238 GB 995 GB 234 GB 239 GB 239 GB
Accelerators / Node None None None 1 1 2
Accelerator N/A N/A N/A NVIDIA V100 PCIe 3 NVIDIA V100 PCIe 3 NVIDIA V100 PCIe 3
Memory / Accelerator N/A N/A N/A 32 GB 32 GB 32 GB
Storage on Node 880 GB SSD None 1.8 TB SSD None 880 GB SSD 880 GB SSD
Interconnect HPE Slingshot HPE Slingshot HPE Slingshot HPE Slingshot HPE Slingshot HPE Slingshot
Operating System SLES SLES SLES SLES SLES SLES

2.3. Queue Information

Queues are where your jobs run. Think of queues as a resource used to control how your job is placed on the available hardware. Queues address hardware considerations and define policies such as what type of jobs can run in the queues, how long your job can run, how much memory your job can use, etc. Every queue has its own limits, behavior, and default values.

On a first-come first-served basis, the scheduler checks whether the resources are available for the first job in the queue. If so, the job is executed without further delay. But if not, the scheduler goes through the rest of the queue to check whether another job can be executed without extending the waiting time of the first job in queue. If it finds such a job, the scheduler backfills the job. Backfill scheduling allows out-of-order jobs to use the reserved job slots if these jobs do not delay the start of another job. Therefore, smaller jobs (i.e., jobs needing only a few resources) usually encounter short queue times.

Your queue options are determined by your projects. Most users have access to the debug, standard, background, transfer, and HPC Interactive Environment (HIE) queues. Other queues exist, but access to these queues is restricted to projects that are granted special privileges due to urgency or importance, and they are not discussed here. To see the list of queues available on the system, use either the qstat -Q or show_queues command. Use the qstat -Qf queue command to get full details about a specific queue.

Standard Queue
As its name suggests, the standard queue is the most common queue and should be used for normal day-to-day jobs.

Debug Queue
When determining why your job is failing, it is very helpful to use the debug queue. It is restricted to user testing and debugging jobs and has a maximum walltime of thirty minutes. Because of the resource and time limits, jobs progress through the debug queue more quickly, so you don't have to wait many hours to get results.

Background Queue
The background queue is a bit special. Although it has the lowest priority, jobs in this queue are not charged against your project allocation. You may choose to run in the background queue for several reasons:

  • You don't care how long it takes for your job to begin running.
  • You are trying to conserve your allocation.
  • You have used up your allocation.

Transfer Queue
The transfer queue exists to help users conserve allocation when transferring data to and from Narwhal from within batch scripts. It has a wall clock limit of 48 hours and jobs run in this queue will not charge to a user's allocation. It supports all environment variables defined by BC policy FY05-04 (Environment Variables), including those referring to storage locations.

Users can submit batch scripts in this queue to move data between various storage areas, file systems, or other systems. The following storage areas are accessible from the transfer queue:

  • $WORKDIR - Your temporary work directory on Narwhal
  • $CENTER - Your directory on the Center Wide File System (CWFS)
  • $ARCHIVE_HOME - Your directory on the mass storage archival system (MSAS) at the Navy DSRC
  • $HOME - Your home directory

HPC Interactive Environment (HIE) Queue
The HIE is both a queue configuration and a computing environment intended to deliver rapid response and high availability to support the following services:

  • Remote visualization
  • Application development for GPU-accelerated applications
  • Application development for other non-standard processors on a particular system

There is a very limited number of nodes available to the HIE queue, and they should be reserved for appropriate use cases. The use of the HIE queue for regular batch processing is considered abuse and is closely monitored. The HIE queue should not be used simply as a mechanism to give your regular batch jobs higher priority. Refer to the HIE User Guide for more information.

Priority Queues
The HPCMP has designated three restricted queues that require special permission for job submission. If your project is not authorized to submit jobs to these queues, your submission will fail. These queues include:

  • Urgent queue - Jobs belonging to DoD HPCMP Urgent Projects
  • High queue - Specific for Jobs belonging to DoD HPCMP High Priority Projects
  • Frontier queue - Specific for jobs belonging to DoD HPCMP Frontier Projects

The following table describes the PBS queues available on Narwhal:

Queue Descriptions and Limits on Narwhal
Priority Queue Name Max Wall Clock Time Max Cores Per Job Max Queued Per User Max Running Per User Description
Highest urgent 24 Hours 16,384 N/A 100 Jobs belonging to DoD HPCMP Urgent Projects
Down arrow for decreasing priority frontier 168 Hours 65,536 N/A 100 Jobs belonging to DoD HPCMP Frontier Projects
high 168 Hours 32,768 N/A 100 Jobs belonging to DoD HPCMP High Priority Projects
debug 30 Minutes 8,192 N/A 4 Time/resource-limited for user testing and debug purposes
HIE 24 Hours 3,072 N/A 1 Rapid response for interactive work. For more information see the HPC Interactive Environment (HIE) User Guide.
viz 24 Hours 128 N/A 8 Visualization jobs
standard 168 Hours 32,768 N/A 100 Standard jobs
mla 24 Hours 128 N/A 8 Machine Learning Accelerated jobs that require a GPU node; PBS assigns the next available smla (1-GPU) or dmla (2-GPU) node.
smla 24 Hours 128 N/A 8 Machine Learning Accelerated jobs that require an smla (Single-GPU MLA) node.
dmla 24 Hours 128 N/A 8 Machine Learning Accelerated jobs that require a dmla (Dual-GPU MLA) node.
serial 168 Hours 1 N/A 26 Single-core serial jobs. 1 core per hour charged to project allocation.
bigmem 96 Hours 1,280 N/A 2 Large-memory jobs
transfer 48 Hours 1 N/A 10 Data transfer for user jobs. Not charged against project allocation. See the Navy DSRC Archive Guide, section 5.2.
Lowest background 4 Hours 1,024 N/A 10 User jobs that are not charged against the project allocation

3. Anatomy of a Batch Script

The PBS scheduler is currently running on Narwhal. It schedules jobs, manages resources and job queues, and can be accessed through the interactive batch environment or by submitting a batch script. PBS can manage both single-processor and multiprocessor jobs. The appropriate module is automatically loaded for you when you log in. This section is a brief introduction to PBS. More advanced topics are discussed later in this document.

Batch Script Life Cycle
Let's start with what happens in the typical life cycle of a batch script, where an application is run in a batch submission:

  1. The user submits a batch script, which is put into the queue.
  2. Once the resources are allocated, the scheduler executes the batch script on one node, and the script has access to the typical environment variables the scheduler defines.
  3. The executable command in the script is encountered and executed. If using a launch command, the launch command examines the scheduler environment variables to determine the node list in the allocation, as well as parameters, such as the number of total processes, and launches the required number of processes.
  4. Once the executing process(es) have terminated, the batch script moves to the next line of execution or terminates if there are no more lines.

Batch Script Anatomy
A batch script is a small text file created with a text editor such as vi or notepad. Although the specifics of batch scripts may differ slightly from system to system, a basic set of components are always required, and a few components are just always good ideas. The basic components of a simple batch script must appear in the following order:

  • Specify Your Shell
  • Scheduler Directives
    • Required Directives
    • Optional Directives
  • The Execution Block

To simplify things, several template scripts are included in Section 7, where you can fill in required commands and resources.

Cautions About Special Characters

Some special characters are not handled well by schedulers. This is especially true of the following:

  • ^M characters - Scripts created on a MS Windows system, which usually contain ^M characters, should be converted with dos2unix before use.
  • Smart quotes - MS Word autocorrects normal straight single and double quotation marks into "smart quotes." Ensure your script only uses normal straight quotation marks.
  • Em dash, en dash, and hyphens - MS Word often autocorrects regular hyphens into em dash or en dash characters. Ensure your script only uses normal hyphens.
  • Tab characters - many editors insert tabs instead of spaces for various reasons. Ensure your script does not contain tabs.

3.1. Specify Your Shell

Your batch script is a shell script. So, it's good practice to specify which shell your script is written in for execution. If you do not specify your shell within the script, the scheduler uses your default login shell. To tell the scheduler which shell to use, the first line of your script should be: #!/bin/shell where shell is either bash (Bourne-Again Shell), sh (Bourne Shell), ksh (korn shell), csh (C shell), tcsh (enhanced C shell), or zsh (Z shell).

3.2. Required Scheduler Directives

After specifying the script shell, the next section of the script sets the scheduler directives, which define your resource requests to the scheduler. These include how many nodes are needed, how many cores per node, what queue the job will run in, and how long these resources are required (walltime).

Directives are a special form of comment, beginning with #PBS. As you might suspect, the # character tells the shell to ignore the line, but the scheduler reads these lines and uses the directives to set various values. IMPORTANT!! All directives MUST come before the first line of executable code in your script, otherwise they are ignored.

The scheduler has numerous directives to assist you in setting up how your job will run on the system. Some directives are required. Others are optional. Required directives specify resources needed to run the application. If your script does not define these directives, your job will be rejected by the scheduler or use center-defined defaults. Caution: default values may not be in line with your job requirements and may vary by center. Optional directives are discussed in Section 5.

To schedule your job, the scheduler must know:

  • The queue to run your job in.
  • The maximum time needed for your job.
  • The Project ID to charge for your job.
  • The number of nodes you are requesting.
  • The number of processes per node you are requesting.
  • The number of cores per node.
3.2.1. Specifying the Queue

You must choose which queue you want your job to run in. Each queue has different priorities and limits and may target different node types with different hardware resources. To specify the queue, include the following directive: #PBS -q queue_name

3.2.2. How Long to Run

Next, the scheduler needs the maximum time you expect your job to run. This is referred to as walltime, as in clock on the wall. The walltime helps the scheduler identify appropriate run windows for your job. For accounting purposes, your allocation is charged for how long your job actually runs, which is typically less than the requested walltime.

In estimating your walltime, there are three things to keep in mind.

  • Your estimate is a limit. If your job hasn't completed within your estimate, it is terminated. So, you should always add a buffer to account for variability in run time because you don't want your job to be killed when it is 99.9% complete. And, if your job is terminated, your account is still charged for the time.
  • Your estimate affects how long your job waits in the queue. In general, shorter jobs run before longer jobs. If you specify a time that is too long, your job will likely sit in the queue longer than it should.
  • Each queue has a maximum time limit. You cannot request more time than the queue allows.

To specify your walltime, include the following directive: #PBS -l walltime=HHH:MM:SS

3.2.3. Your Project ID

The scheduler needs to know which project ID to charge for your job. You can use the show_usage command to find the projects available to you and their associated project IDs. In the show_usage output, project IDs appear in the column labeled "Subproject."

Note: If you have access to multiple projects, remember the project you specify may limit your choice of queues.

To specify the project ID for your job, include the following directive: #PBS -A Project_ID

3.2.4. Number of Nodes, Cores, and Processes

There are two types of computational resources: hardware (compute nodes and cores) and virtual (processes). A node is a computer system with a single operating system image, a unified memory space, and one or more cores. Every script must include directives for the node, core, and process selection. Nodes are allocated exclusively to your job and not shared with other users.

Important: Historically, PBS reserved CPUs via the ncpus directive. However, with the advent of multicore processors, the ncpus directive now actually selects the node type as identified by the number of cores. The ncpus directive must always be set to the number of physical cores on the targeted node type. For example, for standard nodes on Narwhal, ncpus=128. See the Node Configuration Table in Section 2.2 for core counts associated with all node types on Narwhal.

Before PBS can run your job, it needs to know how many nodes you want, how many processes to run per node, and the total number of cores on each node. In general, you would specify one process per core, but you might want fewer processes depending on the programming model you are using. See Example Scripts (below) for alternate use cases. For simple cases, the number of nodes, cores per node, and processes per node are specified using the directive: #PBS -l select=N1:ncpus=N2:mpiprocs=N3 where N1 is the number of nodes you are requesting, N2 is the number of cores on the targeted node type, and N3 is the number of MPI processes per node.

3.2.5. SLB Directives

The Shared License Buffer (SLB) regulates shared license usage across all HPC systems by granting and enforcing license reservations for certain commercial software packages. If your job requires enterprise licenses controlled by SLB, you must enter the software and requested number of licenses, using the following directive: #PBS -l software=number_of_licenses

For more information about SLB, please see the SLB User Guide.

3.3. The Execution Block

After the directives have been supplied, the execution block begins. The execution block is the section of your script containing the actual work to be done. This includes any modules to be loaded and commands to be executed. This could also include executing or sourcing other scripts.

3.3.1. Basic Execution Scheme

The following describes the most basic scheme for a batch script. PLEASE ADOPT THIS BASIC EXECUTION SCHEME IN YOUR OWN BATCH SCRIPTS.

Setup

  • Set environment variables, load modules, create directories, transfer input files.
  • Changing to the right directory - By default PBS runs your job in your home directory, which can cause problems. To avoid this, cd into your $WORKDIR directory to run it on the local high-speed disk.

Launching the executable

  • Launch your executable using the launch command on Narwhal specific to your programming model.

Cleaning up

  • Archive your results and remove temporary files and directories.
  • Copy any necessary files to your home directory.
3.3.1.1. Setup

Using the batch script to set up your environment ensures your script runs in an automatic and consistent manner, but not all environment-setup tasks can be accomplished via scheduler directives, so you may have to set some environment variables yourself. Remember that commands to set up the environment must come after the scheduler directives. For MPI jobs, each MPI process is separate and inherits the environment set up by the batch script.

As part of the Baseline Configuration (BC) initiative, there is a common set of environment variables on all HPCMP allocated systems. These variables are predefined in your login, batch, and compute environments, making them automatically available at each center. We encourage you to use these variables in your scripts where possible. Doing so helps to simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems within the HPCMP. Some BC environment variables are shown in the table below.

HPCMP Baseline Configuration Initiative Common Variables
Variable Description
$WORKDIR Your work directory on the local temporary file system (i.e., local high-speed disk). $WORKDIR is visible to both the login and compute nodes and should be used for temporary storage of active data related to your batch jobs.
$CENTER Your directory on the Center-Wide File System (CWFS).
$ARCHIVE_HOME Your directory on the archival file system that serves a given compute platform.

The complete list of BC environment variables is available in BC Policy FY05-04.

Setup considerations in customizing your batch job may include:

  • Creating a directory in $WORKDIR for your job run in. NEWDIR=$WORKDIR/MyDir mkdir -p $NEWDIR
  • Changing to the directory from which the job will run. cd $NEWDIR
  • Copying required input files to the job directory. cp From_directory/file $NEWDIR
  • Ensuring required modules are loaded. module load module_name
3.3.1.2. Launching an Executable

The command you'll use to launch a parallel executable within a batch script depends on the parallel library loaded at compile and execution time, the programming model, and the machine used to launch the application. It does not depend on the scheduler. Launch commands on Narwhal are discussed in detail in the Narwhal User Guide.

On Narwhal, the mpiexec command is used to launch a parallel executable. The basic syntax for launching an MPI executable is: mpiexec args executable pgmargs

where args are command-line arguments for mpiexec, executable is the name of an executable program, and pgmargs are command-line arguments for the executable.

3.3.1.3. Cleaning Up

You are responsible for cleaning up and monitoring your workspace. The clean-up process generally entails deleting unneeded files and transferring important data left in the job directory after the job is completed. It is important to remember that $WORKDIR is a "scratch" file system and is not backed up. Currently, $WORKDIR files older than 21 days are subject to being purged. If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. Similarly, files transferred to $CENTER are not backed up, and files older than 180 days are subject to being purged. To prevent automatic deletion by the purge scripts, important data should be archived. See the Navy DSRC Archive Guide for more information on archiving data.

3.3.2. Advanced Execution Methods

A batch script is a text file containing directives and execution steps you "submit" to PBS. This script can be as simple as the basic execution scheme discussed above or include more complex customizations, such as compiling within the script or loading a file with a list of modules required for the executable. Below are additional considerations for the execution block of the batch script.

3.3.2.1. Environment Variables set by the Scheduler

In addition to environment variables inherited from your user environment (see Section 3.3.1.1), PBS sets other environment variables for batch jobs. The following table contains commonly used PBS environment variables.

PBS Environment Variables
Variable Description
$PBS_JOBID The job identifier assigned to a job or job array by the batch system
$PBS_O_WORKDIR The absolute path of the directory where the job was submitted
$PBS_JOBDIR The absolute path of the directory where the job runs
$PBS_JOBNAME The job name supplied by the user
$PBS_QUEUE The name of the queue in which the job executes
$PBS_ENVIRONMENT Indicates if the job is batch or interactive
$PBS_ENVIRONMENT=PBS_BATCH | PBS_INTERACTIVE
$PBS_O_PATH Copy of $PATH from your submission environment
$PBS_O_HOST Copy of $HOST from your submission environment
$PBS_O_SHELL Copy of $SHELL from your submission environment
$PBS_NODEFILE The name of a file containing a list of nodes assigned to the job
$PBS_ARRAY_INDEX The index number of a sub-job in a job array
See the qsub man page for a complete list of environment variables set by PBS

Baseline Configuration Policy (BC Policy FY05-04) defines an additional set of environment variables with related functionality available on all systems. These variables can also be found in the Narwhal User Guide.

3.3.2.2. Loading Modules

Software modules are a very convenient way to set needed environment variables and include necessary directories in your path so commands for applications can be found. For a full discussion on modules see the Narwhal User Guide and the Navy DSRC Modules Guide.

To ensure required modules are loaded at runtime, you can load them within the batch script before the executable code by using the command: module load module_name

3.3.2.3. Compiling on the Compute Nodes

You can compile on the compute nodes, either interactively or within a job script. On most systems this is the same as compiling on the login nodes, though in some cases there are differences between the login and compute nodes. See the Narwhal User Guide for more information.

3.3.2.4. Using the Transfer Queue

Before a job can run, the input data needs to be copied into a directory accessible by the job script. This can be done in a separate job script using the transfer queue. Because jobs in the transfer queue cost no allocation, the transfer queue is advantageous for large file transfers such as during data staging or cleanup to move data left in your $WORKDIR after your application completes.

When using the transfer queue, keep in mind:

  • The transfer queue may have additional bandwidth for data transfers.
  • Nodes in the transfer queue are shared with other users, so available compute and memory is likely lower.
  • Your allocation is not charged when using the transfer queue.

See Example Scripts for an example for using the transfer queue.

3.4. Requesting Specialized Nodes

3.4.1. GPU Nodes

The graphics processing unit, or GPU, has become one of the most important types of computing technology. The GPU is made up of many synchronized cores working together for specialized tasks.

To request GPU-accelerated nodes, add the ngpus option to the select directive, as follows: #PBS -l select=N1:ncpus=N2:mpiprocs=N3:ngpus=N4 where N1 is the number of nodes you are requesting, N2 is the number of cores on the targeted node type, N3 is the number of MPI processes per node, and N4 is the number of GPUs required per node.

3.4.2. Visualization Nodes

Visualization nodes are GPU-accelerated nodes with specialized hardware or software to support visualization tasks. Therefore, requesting a visualization node is the same as requesting a single-gpu node. Simply add the ngpus=1 option to the select directive, as follows, and submit your job to the viz or HIE queue: #PBS -l select=N1:ncpus=N2:mpiprocs=N3:ngpus=1 where N1 is the number of nodes you are requesting, N2 is the number of cores on the targeted node type, N3 is the number of MPI processes per node, and one GPU is requested.

3.4.3. Large-Memory Nodes

Large-Memory nodes are standard compute nodes with additional memory installed to support applications that require larger amounts of memory. To request a large-memory node, simply add the bigmem=1 option to the select directive, as follows: #PBS -l select=N1:ncpus=N2:mpiprocs=N3:bigmem=1 where N1 is the number of nodes you are requesting, N2 is the number of cores on the targeted node type, and N3 is the number of MPI processes per node. Note that bigmem=1 is simply a flag to indicate that large-memory nodes are being requested; it is not the number of nodes.

3.4.4. Single-GPU Machine Learning Accelerated (MLA) Nodes

Single-GPU MLA nodes are GPU-accelerated nodes with specialized hardware and software to support machine-learning tasks. To request Single-GPU MLA nodes, simply add the ngpus=1 option to the select directive, as follows, and submit your job to a non-viz, non-HIE queue: #PBS -l select=N1:ncpus=N2:mpiprocs=N3:ngpus=1 where N1 is the number of nodes you are requesting, N2 is the number of cores on the targeted node, and N3 is the number of MPI processes per node. Note that ngpus=1 is simply a flag to indicate that Single-GPU MLA nodes are being requested; it is not the number of nodes.

3.4.5. Dual-GPU Machine Learning Accelerated (MLA) Nodes

Dual-GPU MLA nodes are GPU-accelerated nodes with specialized hardware and software to support machine-learning tasks. To request Dual-GPU MLA nodes, simply add the ngpus=2 option to the select directive, as follows: #PBS -l select=N1:ncpus=N2:mpiprocs=N3:ngpus=2 where N1 is the number of nodes you are requesting, N2 is the number of cores on the targeted node, and N3 is the number of MPI processes per node. Note that ngpus=2 is simply a flag to indicate Dual-GPU MLA nodes are being requested; it is not the number of nodes.

3.5. Advanced Considerations

3.5.1. Heterogeneous Computing (Using Multiple Node Types) and Node Distribution

Heterogeneous computing refers to using more than one type of node, such as CPU, GPU, or large-memory nodes. By assigning different workloads to specialized nodes suited for diverse purposes, performance and energy efficiency can be vastly improved. Node distribution refers to assigning tasks to groups or chunks of nodes. This section discusses how to schedule different node types and organize groups of nodes (heterogeneous or homogeneous) so they can be assigned different tasks.

A chunk is a set of resources allocated as a unit to a job. All parts of a chunk come from the same node. A chunk is often referred to interchangeably as a node, but technically a chunk can be smaller (yet never larger) than a node. The distribution of tasks across chunks can be refined using a form of the PBS select statement as follows: #PBS -l select=N1:ncpus=N2:mpiprocs=N3[+N4:ncpus=N5:mpiprocs=N6[+...] where N1 and N4 are the number of "chunks" you are requesting, N2 and N5 are the number of cores on each chunk, and N3 and N6 are the number of MPI processes per node.

The following example selects one chunk of 128 cores and two chunks of 64 cores: #PBS -l select=1:ncpus=128:mpiprocs=128+2:ncpus=64:mpiprocs=64 How the three chunks get placed onto nodes depends upon the placement setting, which can be "pack" or "scatter." Pack fits the chunks on as few nodes as possible. Scatter places each chunk on its own node, if possible. Because there are 128 cores per compute node, the following placement statement would "pack" all three chunks on a total of two nodes because they will fit: #PBS -l place=pack

The following placement statement would "scatter" one chunk per node for a total of three nodes, with the last two nodes using half of the cores on each node: #PBS -l place=scatter

It is also possible to request heterogeneous resources as part of each chunk. The following example requests two standard compute nodes: one Large-Memory node and one GPU node. #PBS -l select=2:ncpus=128:mpiprocs=128+1:bigmem=1+1:ngpus=1

Select statements have numerous other options. See the qsub and pbs_resources man pages for more information.

3.5.2. Hyper-Threading

On Narwhal, hyper-threading is enabled by default. This allows users to run two tasks or threads per core instead of just one. For example, users can set mpiprocs to two times the cores per node (256) even though there are only 128 physical cores on each node. #PBS -l select=4:ncpus=128:mpiprocs=256

The number of nodes requested for the job is available in the environment variable, $BC_NODE_ALLOC.

To determine the number of cores requested for the job simply multiply $BC_NODE_ALLOC by two.

3.6. Advance Reservation Service Jobs

The Advance Reservation Service (ARS) provides a web-based interface to batch schedulers on most allocated HPC resources in the HPCMP. This service allows allocated users to reserve resources for use at specific times and for specific durations. It works in tandem with selected schedulers to allow restricted access to those reserved resources.

For Advance Reservation Service (ARS) jobs, you must submit a reservation request. When your reservation is made, you receive a confirmation page and an email with the pertinent details of your reservation, including the ARS_ID. It is your responsibility to either use or cancel your reservation. Unless you cancel it, your allocation is charged for the full time on the reserved nodes whether you use them or not. For more information, such as how to cancel a reservation, see the ARS User Guide.

To use the reserved nodes, you must log onto the selected system and submit a job specifying the ARS_ID as the queue, as follows: #PBS -q ARS_ID

4. Submitting and Managing Your Job

Once your batch script is ready, submit it to the scheduler for execution, and the scheduler will generate a job according to the parameters set in the script. Submitting a batch script can be done with the qsub command: qsub batch-script-name

Because batch scripts specify the resources for your job, you won't need to specify any resources on the command line. However, you can override or add any job parameter by providing the specific resource as a flag directly on the qsub command line. Directives supplied in this way override the same directives if they are already included in your script. The syntax to supply directives on the command line is the same as within a script except #PBS is not used. For example: qsub -l walltime=HHH:MM:SS batch-script-name

4.1. Scheduler Dos and Don'ts

When submitting your job, it's important to keep in mind these general guidelines:

  • Request only the resources you need.
  • Be aware of limits. If you request more resources than the hardware can offer, the scheduler might not reject the job, and it may be stuck in the queue forever.
  • Be aware of the available memory limit. In general, the available memory per core is (memory_per_node)/(cores_in_use_on_the_node).
  • The scheduler might not support pinning, so you might want to do this manually.
  • There may be per-user quotas on the system.

You should also keep in mind that Narwhal is a shared resource. Behavior that negatively impacts other users or stresses the system administrators is not desirable. Below are some suggestions to be followed for a happy HPC community.

  • Submitting 1000 jobs to perform 1000 tasks is naïve and can overload the scheduler. If these tasks are serial, it also wastes your allocation hours across 1000 nodes. Job arrays are strongly encouraged, see Section 6.
  • If you expect your job to run for several days, split it into smaller jobs. You'll get reduced queue time and increased stability (e.g., against node failure). You can either split your job manually and submit as separate jobs or submit your jobs sequentially within a single script as described in the Navy DSRC Archive Guide.
  • Send your job to the right queue. It is important to understand in which queue the scheduler will run your job as most queues have core and walltime limits.
  • Do not run compute-intensive tasks from a login node. Doing so slows the login nodes, causing login delays for other users and may prompt administrators to terminate your tasks, often without notice.

4.2. Job Management Commands

Once you submit your job, there are commands available to check and manage your job submission. For example:

  • Determining the status of your job
  • Cancelling your job
  • Putting your job on hold
  • Releasing a job from hold

The table below contains commands for managing your jobs. Use man command or the command --help option to get more information about a command.

PBS Job Management Commands
Command Description
cqstat* Display running and pending jobs, including estimated start times *
pbsnodes Display status of all PBS batch nodes.
qdel job_id Delete a job.
qhold job_id Place a job on hold.
qrls job_id Release a job from hold.
qstat Display the status of all jobs
qstat job_id Check the status of a job.
qstat -q Display the status of all queues.
qsub script_file Submit a job.

Warning: qstat -f can produce a significant amount of data since the output also contains the full list job environment variables. Avoid using this with the -a flag or when performing simple monitoring of job states.

* Estimated start times with cqstat are only available for a small number of jobs and are subject to frequent change as new jobs are queued.

4.3. Job States

When checking the status of a job, the state of a job is listed. Jobs typically pass through several states during their execution. The main job cycle states are QUEUED/PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. An explanation of each state follows.

PBS Job States
Command Description
Q The job is queued, eligible to run, or routed.
R The job is running.
H The job is held.
E The job is exiting after having run.
W The job is waiting for its execution time. Only if start time is declared.
T The job is being moved.

4.4. Baseline Configuration Common Commands

The Baseline Configuration Team (BCT) has established the following set of common commands that are consistent across all systems. Most are custom and not inherent in the scheduler.

BCT Common Commands
Command Description
bcmodule Executes like the standard module command but has numerous improvements and new features.
check_license Checks the status of HPCMP shared applications grouped into two distinct categories: Software License Buffer (SLB) applications and non-SLB applications.
cqstat Displays information about jobs in the batch queueing system.
node_use Displays memory-use and load-average information for all login nodes of the system on which it is executed.
qflag Collects information about the user and the user's jobs and sends a message about the reported problem without any need to leave the HPC system.
qhist Prints a full report on a running or completed batch job with an option to include chronological log file entries for the job from the batch queueing system. The command can also list all completed jobs for a given user over a specified number of days in the past.
qpeek Returns the standard output (stdout) and standard error (stderr) messages for any submitted batch job from the start of execution until the job run is complete.
qview Displays various reports about jobs in the batch queuing system.
show_queues Displays current batch queuing system information.
show_storage Produces two reports on quota and usage information.
show_usage Produces two reports on the allocation and usage status of each subproject under which a user may compute.

5. Optional Directives

In addition to the required directives mentioned above, PBS has many other directives, but most users only use a few of them. Some of the more useful optional directives are summarized below.

5.1. Job Application Directive (Recommended)

The application directive allows you to identify the application being used by your job. This directive is used for HPCMP accountability and administrative purposes and helps the HPCMP accurately assess application usage and ensure adequate software licenses and appropriate software are purchased. While not required, using this directive is strongly encouraged as it provides valuable data to the HPCMP regarding application use.

To use this directive, add a line in the following form to your batch script: #PBS -l application=application_name Or, to your qsub command qsub -l application=application_name ...

A list of application names for use with this directive can be found in $SAMPLES_HOME/Application_Name/application_names on Narwhal.

5.2. Job Name Directive

The -N directive allows you to give your job a name that's easier to remember than a numeric job ID. The PBS environment variable, $PBS_JOBNAME, inherits this value and can be used instead of the job ID to create job-specific output directories. To use this directive, add the following to your batch script: #PBS -N job_name Or, to your qsub command qsub -N job_name ...

5.3. Job Reporting Directives

Job reporting directives allow you to control what happens to standard output and standard error messages generated by your script. They also allow you to specify email options to be executed at the beginning and end of your job. The following table and sections describe the job reporting directives:

PBS Job Reporting Directives
Directive Description
#PBS -o filename Redirect standard error (stderr) to the named file. Appends to the file if it exists, otherwise creates the file.
#PBS -e filename Redirect standard output (stdout) to the named file. Appends to the file if it exists, otherwise creates the file.
#PBS -j eo Merge stderr and stdout into stderr.
#PBS -j oe Merge stderr and stdout into stdout.
#PBS -M email_address Set the email address(es) to be used for email alerts.
#PBS -m b Send email when the job begins.
#PBS -m e Send email when the job ends.
#PBS -m be Send email when the job begins and ends.
5.3.1. Redirecting stderr and stdout

By default, messages written to stdout and stderr are captured for you in files named x.ojob_id and x.ejob_id, respectively, where x is either the name of the script or the name specified with the -N directive or an assigned number, and job_id is the ID of the job. If you want to change this behavior, the -o and -e directives allow you to redirect stdout and stderr messages to different named files. To combine stdout and stderr into a single file, use the -o directive with the -j oe directive or use the -e directive and the -j eo directive. For example, #PBS -o filename.out #PBS -j oe

5.3.2. Setting up Email Alerts

Many users want to be notified when their jobs begin and end. The -m directive makes this possible. If you use this directive, you also need to supply the -M directive with one or more email addresses to be used. For example: #PBS -m be #PBS -M user@email.address[,user2@email.address]

5.4. Job Environment Directives

Job environment directives allow you to control the environment in which your script will operate. This section describes some useful variables in setting up the script environment.

PBS Job Environment Directives
Directive Description
qsub -I Request an interactive job.
#PBS -V Export all environment variables from your login environment to your batch environment.
#PBS -v variable1, variable2 Export specific environment variables from your login environment to your batch environment.
#PBS -v variable=value, ... Export specific environment variables with specific values to your batch environment.
#PBS -l pmem=sizeMB | sizeGB Memory size per process.
5.4.1. Interactive Batch Shell

When you log into Narwhal, you will be running in an interactive shell on a login node. The login nodes provide login access for Narwhal and support such activities as compiling, editing, and general interactive use by all users. Please note the Navy DSRC Login Node Abuse policy.

The preferred method to run resource intensive interactive executions is to use an interactive batch session. An interactive batch session allows you to run interactively (in a command shell) on a compute node after waiting in the batch queue.

Note: Once an interactive session starts, it uses the entire requested block of CPU time and other resources unless you exit from it early, even if you don't use it. To avoid unnecessary charges to your project, don't forget to exit an interactive session once finished.

The -I directive allows you to request an interactive batch shell. Within that shell, you can perform normal Linux commands, including launching parallel jobs. To use -I, append it to your qsub request. For example: qsub your_pbs_options -X -I

The -X directive enables X-Windows access and may be omitted if your interactive job doesn't use a GUI.

Windows users: Please be aware the HPC Help Desk does not provide support for the use of X11 clients with the HPCMP Kerberos Kit for Windows.

Interactive batch sessions are scheduled just like normal batch jobs, so depending on how many other batch jobs are queued, it may take some time. Once your interactive batch shell starts, you will be logged into the first compute node of those assigned to your job. At this point, you can run or debug interactive applications, execute job scripts, post-process data, etc. You can launch parallel applications on your assigned compute nodes by using an MPI or other parallel launch command.

The HPC Interactive Environment (HIE) provides an HIE queue specifically for interactive jobs. It offers longer job times and has nodes reserved only for HIE, so queue wait times are sometimes much shorter. However, HIE has limitations, such as only allowing the use of a single node at a time. See the HIE User Guide for more information before using the HIE queue.

5.4.2. Export Environment Variables

Batch jobs run with their own environment, separate from the login environment from which the batch job is launched. If your application is dependent on environment variables set in the login environment, you need to export these variables from the login environment to the batch environment.

The -V directive tells PBS to export all environment variables from your login environment to your batch environment. To use this directive, add the following line to your batch script: #PBS -V Or, add it to your qsub command, as follows: qsub -V ...

For exporting specific environment variables from your login environment, use the -v directive. To use this directive, add a line in the following form to your batch script: #PBS -v my_variable Or, add it to your qsub command, as follows: qsub -v my_variable ...

Using either of these methods, multiple comma-separated variables can be included. It is also possible to set values for variables exported in this way, as follows: qsub -v my_variable=my_value, ...

5.4.3. Memory Size

The pmem=size directive is used to specify the maximum amount of physical memory used by any process in the job in bytes. For example, if the job would run four processes and each needs up to 2 GB of memory, then the directive would read: #PBS -l pmem=2GB

5.5. Job Dependency Directives


PBS Job Dependency Directives
Directive Description
after Execute this job after listed jobs have begun.
afterok Execute this job after listed jobs have terminated without error.
afternotok Execute this job after listed jobs have terminated with an error.
afterany Execute this job after listed jobs have terminated for any reason.
before Listed jobs may be run after this job begins execution.
beforeok Listed jobs may be run after this job terminates without error.
beforeok Listed jobs may be run after this job terminates without error.
beforenotok Listed jobs may be run after this job terminates with an error.
beforeany Listed jobs may be run after this job terminates for any reason.

Job dependency directives allow you to specify dependencies your job may have on other jobs. This allows you to control the order jobs run in. These directives generally take the following form: #PBS -W depend=dependency_expression where dependency_expression is a comma-delimited list of one or more dependencies, and each dependency is of the form: type:jobids where type is one of the directives listed below, and jobids is a colon-delimited list of one or more job IDs your job is dependent upon.

For example, to run a job after completion (success or failure) of job ID 1234: #PBS -W depend=afterany:1234

To run a job after successful completion of job ID 1234: #PBS -W depend=afterok:1234

For more information about job dependencies, see the qsub man page.

6. Job Arrays

Imagine you have several hundred jobs that are all identical except for two or three parameters whose values vary across a range of input values. Submitting all these jobs individually would not only be tedious but would also incur a lot of overhead, which would impose a significant strain on the scheduler, negatively impacting all scheduled jobs. This example is not an uncommon use case, and it is the reason why job arrays were invented.

Job arrays let you submit and manage collections of similar jobs quickly and easily within a single script, which can significantly relieve the strain on the queueing system. Resource directives are specified once at the top of a job array script and are applied to each array task. As a result, each task has the same initial options (e.g., size, wall time, etc.) but may have different input values.

If your use case includes 200 or more similar jobs that vary by only a few parameters, job arrays are highly recommended.

To implement a PBS job array in your job script, include the directives: #PBS -r y #PBS -J n-m[:step] where n is the starting index, m is the ending index, and the optional step is the increment. The -r y directive flags the job as rerunnable, which tells PBS the script is a job array. PBS then queues this script in FLOOR[(m-n)/step+1] instances, each of which receives its index in the $PBS_ARRAY_INDEX environment variable. You can use the command echo $PBS_ARRAY_INDEX to output the unique index of a job instance. After submitting a job array job, the PBS job ID appears with left and right brackets, [ ], in the output of qstat. So, a job ID that would normally look like "384294" instead looks like "384294[ ]".

Let's look at an explicit example using the PBS directive: #PBS -J 1-999:2

PBS runs 500 instances (FLOOR( (1000-2)/2+1) = 500) of your script, each with a unique value of $PBS_ARRAY_INDEX ranging from 1, 3, 5, 7, ..., 999.

7. Example Scripts

This section provides sample scripts you may copy and use. All scripts follow the anatomy presented in Section 3 and have been tested in their respective scheduler environment. When you use any of these examples, remember to substitute your own Project_ID, job name, output and error files, executable, and clean up. More advanced scripts can be found under the $SAMPLES_HOME directory on the system. Assorted flavors of Hello World are provided in Section 8. These simple programs can be used to test these scripts.

The following Baseline Configuration variables are used in the scripts below.

Baseline Configuration Batch Variables
Variable Description
$BC_CORES_PER_NODE The number of cores per node for the compute node on which a job is running.
$BC_MEM_PER_NODE The approximate maximum memory per node available to an end user program (in integer MB) for the compute node type to which a job is being submitted.
$BC_MPI_TASKS_ALLOC Intended to be referenced from inside a job script, contains the number of MPI tasks/ranks allocated for a particular job.
$BC_NODE_ALLOC Intended to be referenced from inside a job script, contains the number of nodes allocated for a particular job.

7.1. Simple Batch Script

The following is a very basic script to demonstrate requesting resources (including all required directives), setting up the environment, specifying the execution block (i.e., the commands to be executed), and cleaning up after your job completes. Save this as a regular text file using your editor of choice.

#!/bin/bash
#################################################################
# Description:  This basic bash shell script for a simple job.
#               The job can be submitted to the standard queue
#               with the following command:  "qsub basic.pbs"
# Use the "show_usage" command to get your PROJECT_ID(s).
#################################################################
# REQUIRED DIRECTIVES   -----------------------------------------
#################################################################
# Account to be charged
#PBS -A Project_ID

# Set max wall time to 10 minutes
#PBS -l walltime=00:10:00

# Run the job in the standard queue
#PBS -q standard

# Select 1 node, with all the cpus and 16 processes
#PBS -l select=1:ncpus=128:mpiprocs=16

###################################################################
# RECOMMENDED DIRECTIVES   ---------------------------------------
###################################################################
#PBS -l application=other

###################################################################
# OPTIONAL DIRECTIVES   -------------------------------------------
###################################################################
# Name the job 
#PBS -N jobName

# Change stdout and stderr filenames
#PBS -o filename.out
#PBS -e filename.err

###################################################################
# EXECUTION BLOCK   -----------------------------------------------
###################################################################
# Change to the default working directory

cd $WORKDIR
echo "Working directory is $WORKDIR"

echo "-----------------------"
echo "-- Executable Output --"
echo "-----------------------"

# Run the job
# Note: From the ‘#PBS -l select...' statement above
#  BC_MPI_TASKS_ALLOC = 1*16 = (select)*(mpiprocs)
# mpiexec -n 16 ./executable
# OR
mpiexec -n $BC_MPI_TASKS_ALLOC ./executable

###################################################################
# CLEAN UP --------------------------------------------------------
###################################################################
# Remove temporary files and move data to non-scratch directory 
# (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

7.2. Job Information Batch Script

The following examples can be included in the Execution block of any job script. The first example shows Baseline Configuration environment variables available on all HPCMP systems. The second example shows scheduler-specific variables.

#################################################################
# Job information set by Baseline Configuration variables
#################################################################
echo ----------------------------------------------------------
echo "Type of node                    " $BC_NODE_TYPE
echo "CPU cores per node              " $BC_CORES_PER_NODE
echo "CPU cores per standard node     " $BC_STANDARD_NODE_CORES
echo "CPU cores per accelerator node  " $BC_ACCELERATOR_NODE_CORES
echo "CPU cores per big memory node   " $BC_BIGMEM_NODE_CORES
echo "Hostname                        " $BC_HOST
echo "Maxumum memory per nodes        " $BC_MEM_PER_NODE
echo "Number of tasks allocated       " $BC_MPI_TASKS_ALLOC
echo "Number of nodes allocated       " $BC_NODE_ALLOC
echo "Working directory               " $WORKDIR
echo ----------------------------------------------------------

#################################################################
# Output some useful job information.  
##############################################################
echo "-------------------------------------------------------"
echo "User                          " $PBS_O_LOGNAME
echo "User home directory           " $PBS_O_HOME
echo "Job submission directory      " $PBS_O_WORKDIR
echo "Submit host                   " $PBS_O_HOST
echo "Job name                      " $PBS_JOBNAME
echo "Job identifier                " $PBS_JOBID
echo "Job Type                      " $PBS_ENVIRONMENT
echo "Working directory             " $WORKDIR
echo "Job execution directory       " $PBS_JOBDIR
echo "Job Originating queue         " $PBS_O_QUEUE
echo "Job execution queue           " $PBS_QUEUE
echo"----------------------------------------------------------"

7.3. OpenMP Script

To run a pure OpenMP job, specify the number of cores you want from the node (ncpus). Also specify the number of threads (ompthreads) or $OMP_NUM_THREADS defaults to the value of ncpus, possibly resulting in poor performance. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
#PBS -A Project_ID
#PBS -l walltime=00:10:00
#PBS -q standard


#### Use a single node
#PBS -l select=1:ncpus=128:mpiprocs=1


##################################################################
# OPTIONAL DIRECTIVES   ------------------------------------------
##################################################################
#PBS -N jobname

# Change stdout and stderr filenames
#PBS -o filename.out
#PBS -e filename.err

##################################################################
# EXECUTION BLOCK   ----------------------------------------------
##################################################################
# Change to the default working directory
cd $WORKDIR
echo "Working directory is $WORKDIR"


export OMP_NUM_THREADS=$BC_CORES_PER_NODE


# Run the job from the default working directory

./openMP_executable


##################################################################
# CLEAN UP   -----------------------------------------------------
##################################################################

# Remove temporary files and move data to non-scratch directory 
# (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

7.4. Hybrid (MPI/OpenMP) Script

Hybrid MPI/OpenMP scripts are required for executables that MPI between cores and OpenMP inside each core. The following script is an example of hybrid MPI and OpenMP. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
#PBS -A Project_ID
#PBS -l walltime=00:10:00
#PBS -q standard


#PBS -l select=2:ncpus=128:mpiprocs=1


##################################################################
# OPTIONAL DIRECTIVES   ------------------------------------------
##################################################################
#PBS -N jobname

# Change stdout and stderr filenames
#PBS -o filename.out
#PBS -e filename.err

##################################################################
# EXECUTION BLOCK   ----------------------------------------------
##################################################################
cd $WORKDIR
echo "working directory is ${WORKDIR}"


export OMP_WAIT_POLICY=PASSIVE
export OMP_NUM_THREADS=$BC_CORES_PER_NODE
#
# 2 nodes (256 cores), one MPI process per node
# 128 OpenMP threads per node (one per core)
# mpiexec -n 2 -d 128 hybrid_executable
# OR
mpiexec -n $BC_NODE_ALLOC -d $BC_CORES_PER_NODE hybrid_executable

#
##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and move data to non-scratch directory 
# (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

7.5. Accessing More Memory per Process

By default, an MPI job runs one process per core, with all processes sharing the available memory on the node. On Narwhal each compute node has 128 cores and 238 GB of memory. Assuming one process per core, the memory per process is: memory per process = 238 GB/128

If you need more memory per process, then your job needs to run fewer MPI processes per node. This means number_of_processes_per_node < 128. For example, if you request 4 nodes and use only 16 out of 128 cores, this results in a total of 4*16=64 MPI processes. Each of the 16 MPI process per node will have access to approximately 238 GB/16 of memory.

The following script demonstrates this example by requesting 4 nodes and setting 16 processes per node. The job runs for two hours in the standard queue. For more information, refer to the Samples section in the Narwhal User Guide. Note: Differences between the Simple Batch Script and this script are highlighted.

Another way to get more memory per process is to run on bigmem nodes, which is discussed in the next section. However, because there are few bigmem nodes on the system, if you need many cores, bigmem nodes may not be an option.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
#PBS -A ProjectID
#PBS -l walltime=00:10:00
#PBS -q standard

# Starts 64 MPI processes; only 16 processes on each node
# This will result in each process having a memory size of 238 GB/16
#PBS -l select=4:ncpus=128:mpiprocs=16

##################################################################
# OPTIONAL DIRECTIVES   ------------------------------------------
##################################################################
#PBS -N jobname
# Change stdout and stderr filenames
#PBS -o filename.out
#PBS -e filename.err
##################################################################
# EXECUTION BLOCK ------------------------------------------------
##################################################################
cd $WORKDIR
echo "working directory is ${WORKDIR}"

# Execute the application on 4 nodes using
# 16 processes on each node for a total of 64 MPI processes 
mpiexec -n 64 ./executable

##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and move data to non-scratch directory
# (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

7.6. GPU Script

Here is a short example of a script for submitting jobs to a GPU node. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# Script must be run in the nvidia environment:
#       module swap PrgEnv-cray PrgEnv-nvidia
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
#PBS -A Project_ID
#PBS -l walltime=00:10:00
#PBS -q standard


#PBS -l select=2:ncpus=128:ngpus=1


##################################################################
# OPTIONAL DIRECTIVES   ------------------------------------------
##################################################################
#PBS -N jobName

# Change stdout and stderr filenames
#PBS -o filename.out
#PBS -e filename.err

##################################################################
# EXECUTION BLOCK   ----------------------------------------------
##################################################################
cd $WORKDIR

# Run the job from the default working directory

./GPU_executable

##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and move data to non-scratch directory
# (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

7.7. Data Transfer Script

The transfer queue is a special-purpose queue for transferring or archiving files. It has access to $HOME, $ARCHIVE_HOME, $WORKDIR, and $CENTER. Jobs running in the transfer queue are charged for a single core against your allocation. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
#PBS -A Project_ID
#PBS -l walltime=00:10:00

#PBS -q transfer

#PBS -l select=1:ncpus=1

##################################################################
# OPTIONAL DIRECTIVES   ------------------------------------------
##################################################################
#PBS -N jobName
#PBS -o filename.out
#PBS -e filename.err

##################################################################
# EXECUTION BLOCK ------------------------------------------------
##################################################################
# Change to work director

cd $WORKDIR

# Assume all files to be transferred from are in $WORKDIR/from_dir
export FROM_DIR=$WORKDIR/from_dir

# Assume all files are to be transferred to $ARCHIVE_HOME
export TO_DIR=$ARCHIVE_HOME

# Create a gzip file to reduce data transfer time
tar -czf $FROM_DIR.gz .

# If needed, uncomment to create a directory on the archive
# archive mkdir -C $TO_DIR

# List the archive directory contents to verify data transfer
archive ls -al $TO_DIR

echo "Transfer job ended"

7.8. Job Array Script

As was discussed in Section 6, job arrays allow you to leverage a scheduler's ability to create multiple jobs from one script. Many of the situations where this is useful include:

  • Establishing a list of commands to run and have a job created from each command in the list.
  • Running many parameters against one set of data or analysis program.
  • Running the same program multiple times with different sets of data.

Creating directories and output files that are unique to each job array's task is essential when using job arrays. This is shown in the script below. Use qsub -r y job_script to submit the job to the PBS scheduler. The -r y flag indicates the job is reusable. After submitting the job array script, use qstat -sw jobArrayID[] (e.g., qstat -sw 468028[]) to show the output from a queued job array job.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
#PBS -A Project ID
#PBS -l walltime=00:20:00
#PBS -q standard
#PBS -l select=1:ncpus=128:mpiprocs=128

# Set up a job array from 1 to 4 in steps of 1.
#PBS -r y
#PBS -J 1-4:1

##################################################################
# EXECUTION BLOCK   ----------------------------------------------
##################################################################
cd $WORKDIR
JA_ID = `echo $PBS_JOBID | cut -d'[' -f1`
JA_DIR = $WORKDIR/Job_Array.o${JA_ID}

# Output Job ID and Job array index information
echo "PBS Job ID PBS_JOBID is $PBS_JOBID"
echo "PBS job array index PBS_ARRAY_INDEX value is $PBS_ARRAY_INDEX"
#
# Make a directory for each task in the array
mkdir $JA_DIR
#
# Change into to task specific directory to run each task
cd $JA_DIR
#
# Retrieve the job's binary
cp $WORKDIR/executable $JA_DIR/executable_$PBS_ARRAY_INDEX
#
# Run job and redirect output
Export outfile=$JA_DIR/$JA_ID_$PBS_ARRAY_INDEX
mpiexec -n 128 ./executable_$PBS_ARRAY_INDEX &> $outfile

##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and move data to non-scratch directory 
# (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

7.9. Large-Memory Node Script

The standard compute nodes on Narwhal contain 238 GB of RAM and 128 cores. That works out to 1.86 GB/core. This is fine for most jobs running on the system. However, some jobs require more memory per core. To accommodate these jobs, Narwhal has 26 large-memory nodes with 995 GB of memory. You can allocate a job on the large-memory nodes by submitting a large-memory job script. Differences between the Simple Batch Script and this script are highlighted.

#!/bin/bash
##################################################################
# REQUIRED DIRECTIVES   ------------------------------------------
##################################################################
#PBS -A Project_ID
#PBS -l walltime=00:10:00

#PBS -q standard

#PBS -l select=1:ncpus=128:mpiprocs= num_processes:bigmem=1

##################################################################
# OPTIONAL DIRECTIVES   ------------------------------------------
##################################################################
#PBS -N jobName

# Change stdout and stderr filenames
#PBS -o filename.out
#PBS -e filename.err

##################################################################
# EXECUTION BLOCK   ----------------------------------------------
##################################################################
cd $WORKDIR
echo "working directory is ${WORKDIR}"

mpiexec -n num_processes executable
#
##################################################################
# CLEAN UP -------------------------------------------------------
##################################################################
# Remove temporary files and move data to non-scratch directory
# (Home or archive)

# See the "Archival In Compute Jobs" section (Section 4) of the
# Navy DSRC Archive Guide for a detailed example of performing
# archival operations within a job script.

8. Hello World Examples

This section provides code to differing examples of the basic hello.c program. Refer to the Narwhal User Guide for information about compiling.

8.1. C Program - hello.c

/**************************************************************
* A simple program to demonstrate an MPI executable
***************************************************************/
#include <mpi.h>
#include <stdio.h>

int rank;
int numNodes;
char processorName[MPI_MAX_PROCESSOR_NAME];
int nameLen;

int main(int argc, char** argv) {

    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of nodes
    MPI_Comm_size(MPI_COMM_WORLD, &numRanks);

    // Get the rank of this process
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // Get the name of the processor
    MPI_Get_processor_name(processorName, &nameLen);

    // Print messages from each processor
    printf("Hello from processor %s - ", processorName);
    printf("I am rank %d out of %d ranks\n", rank, numRanks);

    // Finalize the MPI environment
    MPI_Finalize();
} // end main

8.2. OpenMP - hello-OpenMP.c

/*************************************************************
* A simple program to demonstrate a pure OpenMPI executable
***************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>           // needed for OpenMP
#include <unistd.h>        // only needed for definition of gethostname
#include <sys/param.h>     // only needed for definition of MAXHOSTNAMELEN

int main (int argc, char *argv[]) {
  int th_id, nthreads;
  char foo[] = "Hello";
  char bar[] = "World";
  char hostname[MAXHOSTNAMELEN];
  gethostname(hostname, MAXHOSTNAMELEN);
  #pragma omp parallel private(th_id)
  {
    th_id = omp_get_thread_num();
    printf("%s %s from thread %d on %s!\n", foo, bar, th_id, hostname);
    #pragma omp barrier
    if ( th_id == 0 ) {
      nthreads = omp_get_num_threads();
      printf("There were %d threads on %s!\n", nthreads, hostname);
    }
  }
  return EXIT_SUCCESS;
}

8.3. Hybrid MPI/Open MP - hello-hybrid.c

/*****************************************************************
* A simple program to demonstrate a Hybrid MPI/OpenMP executable
*****************************************************************/
#include <stdio.h>
#include <omp.h>
#include "mpi.h"

int main(int argc, char *argv[]) {
    int numprocs, rank, namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int iam = 0, np = 1;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name, &namelen);

    #pragma omp parallel default(shared) private(iam, np)
    {
        np = omp_get_num_threads();
        iam = omp_get_thread_num();
        printf("Hello from thread %d of %d, 
                   iam, np);
        printf( from process %d out of %d on %s\n",
                   rank, numprocs, processor_name);
    }

    MPI_Finalize();
}

8.4. Cuda - hello-cuda.cu

/*****************************************************************
* A simple program to demonstrate a CUDA/GPU executable
* May require a module swap: 
*		module swap PrgEnv-cray PrgEnv-nvidia
* Check User manual for compiling GPU code
******************************************************************/
#include <stdio.h>
#include <stdlib.h>

#include <cuda.h>

void cuda_device_init(void)
 {
    int ndev;
    cudaGetDeviceCount(&ndev);
    cudaDeviceSynchronize();
    if (ndev == 1)
       printf("There is %d GPU.\n",ndev);
    else
       printf("There are %d GPUs\n",ndev);

    for(int i=0;i<ndev;i++) {
       cudaDeviceProp pdev;
       cudaGetDeviceProperties(&pdev,i);
       cudaDeviceSynchronize();
       printf("Hello from GPU %d\n",i);
       printf("GPU type  : %s\n",pdev.name);
       printf("Memory Global: %d Mb\n",\
                       (pdev.totalGlobalMem+1024*1024)/1024/1024);
       printf("Memory Const : %d Kb\n",pdev.totalConstMem/1024);
       printf("Memory Shared: %d Kb\n",pdev.sharedMemPerBlock/1024);
       printf("Clock Rate  : %.3f GHz\n",pdev.clockRate/1000000.0);
       printf("Number of Processors  : %d\n",pdev.multiProcessorCount);
       printf("Number of Cores  : %d\n",8*pdev.multiProcessorCount);
       printf("Warp Size : %d\n",pdev.warpSize);
       printf("Max Thr/Blk  : %d\n",pdev.maxThreadsPerBlock);
       printf("Max Blk Size : %d %d %d\n",\
                       pdev.maxThreadsDim[0],pdev.maxThreadsDim[1],\
                       pdev.maxThreadsDim[2]);
       printf("Max Grid Size: %d %d %d\n",\
                       pdev.maxGridSize[0],pdev.maxGridSize[1],\
                       pdev.maxGridSize[2]);
    }
}

int main(int argc, char * argv[]) {

   cuda_device_init();
   return 0;
}

/**************************************************************
* Compile Script for hello-cuda on Narwhal
***************************************************************/
#!/bin/bash
#
. $MODULESHOME/init/bash
module swap PrgEnv-cray PrgEnv-nvidia
#
set -x
#
nvcc -o hello-cuda.exe hello-cuda.cu

9. Batch Scheduler Rosetta


Batch Scheduler Rosetta
User Commands PBS Slurm LSF
Job Submission qsub Script_File sbatch Script_File bsub < Script_File
Job Deletion qdel Job_ID scancel Job_ID bkill Job_ID
Job status
(by job)
qstat Job_ID squeue Job_ID bjobs Job_ID
Job status
(by user)
qstat -u User_Name squeue -u User_Name bjobs -u User_Name
Job hold qhold Job_ID scontrol hold Job_ID bstop Job_ID
Job release qrls Job_ID scontrol release Job_ID bresume Job_ID
Queue list qstat -Q squeue bqueues
Node list pbsnodes -l sinfo -N OR
scontrol show nodes
bhosts
Cluster status qstat -a sinfo bqueues
GUI xpbsmon sview xlsf OR
xlsbatch
EnvironmentPBSSlurmLSF
Job ID $PBS_JOBID $SLURM_JOBID $LSB_JOBID
Submit Directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR $LSB_SUBCWD
Submit Host $PBS_O_HOST $SLURM_SUBMIT_HOST $LSB_SUB_HOST
Node List $PBS_NODEFILE $SLURM_JOB_NODELIST $LSB_HOSTS/LSB_MCPU_HOST
Job Array Index $PBS_ARRAYID $SLURM_ARRAY_TASK_ID $LSB_JOBINDEX
Job SpecificationPBSSlurmLSF
Script Directive #PBS #SBATCH #BSUB
Queue -q Queue_Name ARL: -p Queue_Name
AFRL and Navy: -q Queue_Name
-q Queue_Name
Node Count -l select=N1:ncpus=N2: mpiprocs=N3

(N1 = Node count
N2 = Max cores per node
N3 = Cores to use per node)
-N min[-max] -n CoreCount -R "span[ptile=CoresPerNode]"

(NodeCount = CoreCount / Cores Per Node)
Core Count -l select=N1:ncpus=N2: mpiprocs=N3
(N1 = Node count
N2 = Max cores per node
N3 = Cores to use per node
Core Count = N1 x N3)
--ntasks=total_cores_in_run -n Core_Count
Wall Clock Limit -l walltime=hh:mm:ss -t min OR
-t days-hh:mm:ss
-W hh:mm
Standard Output File -o File_Name -o File_Name -o File_Name
Standard Error File -e File_Name -e File_Name -e File_Name
Combine stdout/err -j oe (both to stdout) OR
-j eo (both to stderr)
(use -o without -e) (use -o without -e)
Copy Environment -V --export=ALL|NONE|Variable_List
Event Notification -m [a][b][e] --mail-type=[BEGIN],[END],[FAIL] -B or -N
Email Address -M Email_Address --mail-user=Email_Address -u Email_Address
Job Name -N Job_Name --job-name=Job_Name -J Job_Name
Job Restart -r y|n --requeue OR
--no-requeue
(NOTE: configurable default)
-r
Working Directory No option – defaults to home directory --workdir=/Directory/Path No option – defaults to submission directory
Resource Sharing -l place=scatter:excl --exclusive OR
--shared
-x
Account to charge -A Project_ID --account=Project_ID -P Project_ID
Tasks per Node -l select=N1:ncpus=N2: mpiprocs=N3

(N1 = Node count
N2 = Max cores per node
N3 = Cores to use per node)
--tasks-per-node=count
Job Dependency -W depend=state:Job_ID[:Job_ID...][,state:Job_ID[:Job_ID...]] --depend=state:Job_ID -w done|exit|finish
Job host preference --nodelist=nodes AND/OR
--exclude=nodes
-m Node_List
(i.e., "inf001" -or- inf[001-128])
OR
-m node_type
(i.e., "inference", "training", or "visualization")
Job Arrays -J N-M[:step][%Max_Jobs] --array=N-M[:step] -J "Array_Name[N-M[:step]][%Max_Jobs]"
(Note: bold black brackets are literal)
Generic Resources -l other=Resource_Spec --gres=Resource_Spec
Licenses -l app=number
Example: -l abaqus=21
(Note: license resource allocation)
-L app:number Example -L abaqus:21
-R "rusage[License_Spec]"
(Note: brackets are literal)
Begin Time -a [[[YYYY]MM]DD]hhmm[.ss]
(Note: no delimiters)
--begin=YYYY-MM-DD[Thh:mm[:ss]] -b [[YYYY:][MM:]DD:]hh:mm

10. Glossary

Batch-scheduled
:
users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available
Batch Script
:
A script that provides resource requirements and commands for the job.
Pinning
:
Pinning threads for shared-memory parallelism or binding processes for distributed-memory parallelism is an advanced way to control how your system distributes the threads or processes across the available cores.