Narwhal PBS Guide
Table of Contents
- 1. Introduction
- 1.1. Document Scope
- 2. Resources and Queue Information
- 2.1. Resource Summary
- 2.2. Node Information
- 2.3. Queue Information
- 3. Anatomy of a Batch Script
- 3.1. Specify Your Shell
- 3.2. Required Scheduler Directives
- 3.3. The Execution Block
- 3.4. Requesting Specialized Nodes
- 3.5. Advanced Considerations
- 3.6. Advance Reservation Service Jobs
- 4. Submitting and Managing Your Job
- 4.1. Scheduler Dos and Don'ts
- 4.2. Job Management Commands
- 4.3. Job States
- 4.4. Baseline Configuration Common Commands
- 5. Optional Directives
- 5.1. Job Application Directive (Recommended)
- 5.2. Job Name Directive
- 5.3. Job Reporting Directives
- 5.4. Job Environment Directives
- 5.5. Job Dependency Directives
- 6. Job Arrays
- 7. Example Scripts
- 7.1. Simple Batch Script
- 7.2. Job Information Batch Script
- 7.3. OpenMP Script
- 7.4. Hybrid (MPI/OpenMP) Script
- 7.5. Accessing More Memory per Process
- 7.6. GPU Script
- 7.7. Data Transfer Script
- 7.8. Job Array Script
- 7.9. Large-Memory Node Script
- 8. Hello World Examples
- 8.1. C Program - hello.c
- 8.2. OpenMP - hello-OpenMP.c
- 8.3. Hybrid MPI/Open MP - hello-hybrid.c
- 8.4. Cuda - hello-cuda.cu
- 9. Batch Scheduler Rosetta
- 10. Glossary
1. Introduction
On large-scale computers, many users must share available resources. Because of this, you can't just log on to one of these systems, upload your programs, and start running them. Essentially, your programs must "get in line" and wait their turn, and there is more than one of these lines, or queues, from which to choose. Some queues have a higher priority than others (like the express checkout at the grocery store). The queues available to you are determined by the projects you are involved with.
To perform any task on the compute cluster, you must submit it as a "job" to a special piece of software called the scheduler or batch queueing system. At its most basic, a job can be a command non-interactively, but any command (or series of commands) you want to run on the system is called a job.
Before you can submit your job to the scheduler, you must describe it, usually in the form of a batch script. The batch script specifies the computing resources needed, identifies an application to be run (along with its input data and environment variables), and describes how best to deliver the output data.
The process of using a scheduler to run the job is called batch job submission. When you submit a job, it is placed in a queue with jobs from other users. The scheduler then manages which jobs run, where, and when. Without the scheduler users could overload the system, resulting in tremendous performance degradation for everyone. The queuing system runs your job as soon as it can do so while still honoring the following:
- Meeting your resource requests
- Not overloading the system
- Running higher priority jobs first
- Maximizing overall throughput
The process can be summarized as:
- Create a batch script.
- Submit a job.
- Monitor a job.
1.1. Document Scope
This document provides an overview and introduction to the use of the PBS batch scheduler on the HPE Cray EX (Narwhal) located at the Navy DSRC. The intent of this guide is to provide information to enable the average user to submit jobs on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:
- Use of the Linux operating system
- Use of an editor (e.g., vi or emacs)
- Remote use of computer systems via network
- A selected programming language and its related tools and libraries
We suggest you review the Narwhal User Guide before using this guide.
2. Resources and Queue Information
2.1. Resource Summary
When working on an HPC system you must specify the resources your job needs to run. This lets the scheduler find the right time and place to schedule your job. Strict adherence to resource requests allows PBS to find the best possible place for your jobs and ensures no user can use more resources than they've been given. You should always try to specify resource limits that are close to but greater than your requirements so your job can be scheduled more quickly. This is because PBS must wait until the requested resources are available before it can run your job. You cannot request more resources than are available on the system, and you cannot use more resources than you request. If you do, your job may be rejected, fail, or remain indefinitely in the queue.
Narwhal is a batch-scheduled Batch-scheduled - users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available HPC system with numerous nodes. All jobs that require large amounts of system resources must be submitted as a batch script Batch Script - A script that provides resource requirements and commands for the job.. As discussed in Section 3, scripts are used to submit a series of directives that define the resources required by your job. The most basic resources include time, nodes, and memory.
2.2. Node Information
Below is a summary of node types available on Narwhal. Refer to the Narwhal User Guide for in-depth information.
- Login nodes - Access points for submitting jobs on Narwhal. Login nodes are intended for basic tasks such as uploading data, managing files, compiling software, editing scripts, and checking on or managing your jobs. DO NOT run your computations on the login nodes.
- Compute nodes - Node types such as "Standard ", "Large-Memory", "GPU", etc.
are considered compute nodes. Compute nodes can include:
- Standard nodes - The compute node type that is standard on Narwhal.
- Large-Memory Nodes - Large-memory nodes have more memory than standard nodes and are intended for jobs that require a large amount of memory.
- Single-GPU Machine Learning Accelerator (Single-GPU MLA) nodes - MLA nodes are specialized GPU nodes intended for machine learning and other compute-intensive applications.
- Dual-GPU Machine Learning Accelerator (Dual-GPU MLA) nodes – MLA nodes are specialized GPU nodes intended for machine learning and other compute-intensive applications.
- Visualization nodes - Visualization nodes are GPU nodes intended for visualization applications.
A summary of the node configuration on Narwhal is presented in the following table.
Login | Standard | Large-Memory | Visualization | Single-GPU MLA | Dual-GPU MLA | |
---|---|---|---|---|---|---|
Total Nodes | 11 | 2,304 | 26 | 16 | 32 | 32 |
Processor | AMD 7H12 Rome | AMD 7H12 Rome | AMD 7H12 Rome | AMD 7H12 Rome | AMD 7H12 Rome | AMD 7H12 Rome |
Processor Speed | 2.6 GHz | 2.6 GHz | 2.6 GHz | 2.6 GHz | 2.6 GHz | 2.6 GHz |
Sockets / Node | 2 | 2 | 2 | 2 | 2 | 2 |
Cores / Node | 128 | 128 | 128 | 128 | 128 | 128 |
Total CPU Cores | 1,408 | 294,912 | 3,328 | 2,048 | 4,096 | 4,096 |
Usable Memory / Node | 226 GB | 238 GB | 995 GB | 234 GB | 239 GB | 239 GB |
Accelerators / Node | None | None | None | 1 | 1 | 2 |
Accelerator | N/A | N/A | N/A | NVIDIA V100 PCIe 3 | NVIDIA V100 PCIe 3 | NVIDIA V100 PCIe 3 |
Memory / Accelerator | N/A | N/A | N/A | 32 GB | 32 GB | 32 GB |
Storage on Node | 880 GB SSD | None | 1.8 TB SSD | None | 880 GB SSD | 880 GB SSD |
Interconnect | HPE Slingshot | HPE Slingshot | HPE Slingshot | HPE Slingshot | HPE Slingshot | HPE Slingshot |
Operating System | SLES | SLES | SLES | SLES | SLES | SLES |
2.3. Queue Information
Queues are where your jobs run. Think of queues as a resource used to control how your job is placed on the available hardware. Queues address hardware considerations and define policies such as what type of jobs can run in the queues, how long your job can run, how much memory your job can use, etc. Every queue has its own limits, behavior, and default values.
On a first-come first-served basis, the scheduler checks whether the resources are available for the first job in the queue. If so, the job is executed without further delay. But if not, the scheduler goes through the rest of the queue to check whether another job can be executed without extending the waiting time of the first job in queue. If it finds such a job, the scheduler backfills the job. Backfill scheduling allows out-of-order jobs to use the reserved job slots if these jobs do not delay the start of another job. Therefore, smaller jobs (i.e., jobs needing only a few resources) usually encounter short queue times.
Your queue options are determined by your projects. Most users have access to the debug, standard, background, transfer, and HPC Interactive Environment (HIE) queues. Other queues exist, but access to these queues is restricted to projects that are granted special privileges due to urgency or importance, and they are not discussed here. To see the list of queues available on the system, use either the qstat -Q or show_queues command. Use the qstat -Qf queue command to get full details about a specific queue.
Standard Queue
As its name suggests, the standard queue is the
most common queue and should be used for normal day-to-day jobs.
Debug Queue
When determining why your job is failing, it is very
helpful to use the debug queue. It is restricted to user testing and debugging
jobs and has a maximum walltime of thirty minutes. Because of the resource
and time limits, jobs progress through the debug queue more quickly, so you
don't have to wait many hours to get results.
Background Queue
The background queue is a bit special. Although
it has the lowest priority, jobs in this queue are not charged against
your project allocation. You may choose to run in the background queue for
several reasons:
- You don't care how long it takes for your job to begin running.
- You are trying to conserve your allocation.
- You have used up your allocation.
Transfer Queue
The transfer queue exists to help users conserve allocation when transferring
data to and from Narwhal from within batch scripts. It has a wall clock limit
of 48 hours and jobs run in this queue will not charge to a user's allocation.
It supports all environment variables defined by
BC policy FY05-04
(Environment Variables), including those referring to storage locations.
Users can submit batch scripts in this queue to move data between various storage areas, file systems, or other systems. The following storage areas are accessible from the transfer queue:
- $WORKDIR - Your temporary work directory on Narwhal
- $CENTER - Your directory on the Center Wide File System (CWFS)
- $ARCHIVE_HOME - Your directory on the mass storage archival system (MSAS) at the Navy DSRC
- $HOME - Your home directory
HPC Interactive Environment (HIE) Queue
The HIE is both a queue
configuration and a computing environment intended to deliver rapid response
and high availability to support the following services:
- Remote visualization
- Application development for GPU-accelerated applications
- Application development for other non-standard processors on a particular system
There is a very limited number of nodes available to the HIE queue, and they should be reserved for appropriate use cases. The use of the HIE queue for regular batch processing is considered abuse and is closely monitored. The HIE queue should not be used simply as a mechanism to give your regular batch jobs higher priority. Refer to the HIE User Guide for more information.
Priority Queues
The HPCMP has designated three restricted queues
that require special permission for job submission. If your project
is not authorized to submit jobs to these queues, your submission will fail.
These queues include:
- Urgent queue - Jobs belonging to DoD HPCMP Urgent Projects
- High queue - Specific for Jobs belonging to DoD HPCMP High Priority Projects
- Frontier queue - Specific for jobs belonging to DoD HPCMP Frontier Projects
The following table describes the PBS queues available on Narwhal:
Priority | Queue Name | Max Wall Clock Time | Max Cores Per Job | Max Queued Per User | Max Running Per User | Description |
---|---|---|---|---|---|---|
Highest | urgent | 24 Hours | 16,384 | N/A | 100 | Jobs belonging to DoD HPCMP Urgent Projects |
frontier | 168 Hours | 65,536 | N/A | 100 | Jobs belonging to DoD HPCMP Frontier Projects | |
high | 168 Hours | 32,768 | N/A | 100 | Jobs belonging to DoD HPCMP High Priority Projects | |
debug | 30 Minutes | 8,192 | N/A | 4 | Time/resource-limited for user testing and debug purposes | |
HIE | 24 Hours | 3,072 | N/A | 1 | Rapid response for interactive work. For more information see the HPC Interactive Environment (HIE) User Guide. | |
viz | 24 Hours | 128 | N/A | 8 | Visualization jobs | |
standard | 168 Hours | 32,768 | N/A | 100 | Standard jobs | |
mla | 24 Hours | 128 | N/A | 8 | Machine Learning Accelerated jobs that require a GPU node; PBS assigns the next available smla (1-GPU) or dmla (2-GPU) node. | |
smla | 24 Hours | 128 | N/A | 8 | Machine Learning Accelerated jobs that require an smla (Single-GPU MLA) node. | |
dmla | 24 Hours | 128 | N/A | 8 | Machine Learning Accelerated jobs that require a dmla (Dual-GPU MLA) node. | |
serial | 168 Hours | 1 | N/A | 26 | Single-core serial jobs. 1 core per hour charged to project allocation. | |
bigmem | 96 Hours | 1,280 | N/A | 2 | Large-memory jobs | |
transfer | 48 Hours | 1 | N/A | 10 | Data transfer for user jobs. Not charged against project allocation. See the Navy DSRC Archive Guide, section 5.2. | |
Lowest | background | 4 Hours | 1,024 | N/A | 10 | User jobs that are not charged against the project allocation |
3. Anatomy of a Batch Script
The PBS scheduler is currently running on Narwhal. It schedules jobs, manages resources and job queues, and can be accessed through the interactive batch environment or by submitting a batch script. PBS can manage both single-processor and multiprocessor jobs. The appropriate module is automatically loaded for you when you log in. This section is a brief introduction to PBS. More advanced topics are discussed later in this document.
Batch Script Life Cycle
Let's start with what happens in the typical
life cycle of a batch script, where an application is run in a batch submission:
- The user submits a batch script, which is put into the queue.
- Once the resources are allocated, the scheduler executes the batch script on one node, and the script has access to the typical environment variables the scheduler defines.
- The executable command in the script is encountered and executed. If using a launch command, the launch command examines the scheduler environment variables to determine the node list in the allocation, as well as parameters, such as the number of total processes, and launches the required number of processes.
- Once the executing process(es) have terminated, the batch script moves to the next line of execution or terminates if there are no more lines.
Batch Script Anatomy
A batch script is a small text file
created with a text editor such as vi or notepad. Although the specifics
of batch scripts may differ slightly from system to system, a basic set of
components are always required, and a few components are just always good ideas.
The basic components of a simple batch script must appear in the following order:
- Specify Your Shell
- Scheduler Directives
- Required Directives
- Optional Directives
- The Execution Block
To simplify things, several template scripts are included in Section 7, where you can fill in required commands and resources.
Cautions About Special Characters
Some special characters are not handled well by schedulers. This is especially true of the following:
- ^M characters - Scripts created on a MS Windows system, which usually contain ^M characters, should be converted with dos2unix before use.
- Smart quotes - MS Word autocorrects normal straight single and double quotation marks into "smart quotes." Ensure your script only uses normal straight quotation marks.
- Em dash, en dash, and hyphens - MS Word often autocorrects regular hyphens into em dash or en dash characters. Ensure your script only uses normal hyphens.
- Tab characters - many editors insert tabs instead of spaces for various reasons. Ensure your script does not contain tabs.
3.1. Specify Your Shell
Your batch script is a shell script. So, it's good practice to specify which
shell your script is written in for execution. If you do not specify your shell
within the script, the scheduler uses your default login shell. To tell
the scheduler which shell to use, the first line of your script should be:
#!/bin/shell
where shell is either bash (Bourne-Again Shell), sh (Bourne Shell), ksh
(korn shell), csh (C shell), tcsh (enhanced C shell), or zsh (Z shell).
3.2. Required Scheduler Directives
After specifying the script shell, the next section of the script sets the scheduler directives, which define your resource requests to the scheduler. These include how many nodes are needed, how many cores per node, what queue the job will run in, and how long these resources are required (walltime).
Directives are a special form of comment, beginning with #PBS. As you might suspect, the # character tells the shell to ignore the line, but the scheduler reads these lines and uses the directives to set various values. IMPORTANT!! All directives MUST come before the first line of executable code in your script, otherwise they are ignored.
The scheduler has numerous directives to assist you in setting up how your job will run on the system. Some directives are required. Others are optional. Required directives specify resources needed to run the application. If your script does not define these directives, your job will be rejected by the scheduler or use center-defined defaults. Caution: default values may not be in line with your job requirements and may vary by center. Optional directives are discussed in Section 5.
To schedule your job, the scheduler must know:
- The queue to run your job in.
- The maximum time needed for your job.
- The Project ID to charge for your job.
- The number of nodes you are requesting.
- The number of processes per node you are requesting.
- The number of cores per node.
3.2.1. Specifying the Queue
You must choose which queue you want your job to run in. Each queue has different
priorities and limits and may target different node types with different hardware
resources. To specify the queue, include the following directive:
#PBS -q queue_name
3.2.2. How Long to Run
Next, the scheduler needs the maximum time you expect your job to run. This is referred to as walltime, as in clock on the wall. The walltime helps the scheduler identify appropriate run windows for your job. For accounting purposes, your allocation is charged for how long your job actually runs, which is typically less than the requested walltime.
In estimating your walltime, there are three things to keep in mind.
- Your estimate is a limit. If your job hasn't completed within your estimate, it is terminated. So, you should always add a buffer to account for variability in run time because you don't want your job to be killed when it is 99.9% complete. And, if your job is terminated, your account is still charged for the time.
- Your estimate affects how long your job waits in the queue. In general, shorter jobs run before longer jobs. If you specify a time that is too long, your job will likely sit in the queue longer than it should.
- Each queue has a maximum time limit. You cannot request more time than the queue allows.
To specify your walltime, include the following directive:
#PBS -l walltime=HHH:MM:SS
3.2.3. Your Project ID
The scheduler needs to know which project ID to charge for your job. You can use the show_usage command to find the projects available to you and their associated project IDs. In the show_usage output, project IDs appear in the column labeled "Subproject."
Note: If you have access to multiple projects, remember the project you specify may limit your choice of queues.
To specify the project ID for your job, include the following directive:
#PBS -A Project_ID
3.2.4. Number of Nodes, Cores, and Processes
There are two types of computational resources: hardware (compute nodes and cores) and virtual (processes). A node is a computer system with a single operating system image, a unified memory space, and one or more cores. Every script must include directives for the node, core, and process selection. Nodes are allocated exclusively to your job and not shared with other users.
Important: Historically, PBS reserved CPUs via the ncpus directive. However, with the advent of multicore processors, the ncpus directive now actually selects the node type as identified by the number of cores. The ncpus directive must always be set to the number of physical cores on the targeted node type. For example, for standard nodes on Narwhal, ncpus=128. See the Node Configuration Table in Section 2.2 for core counts associated with all node types on Narwhal.
Before PBS can run your job, it needs to know how many nodes you want, how
many processes to run per node, and the total number of cores on each node.
In general, you would specify one process per core, but you might want fewer
processes depending on the programming model you are using. See
Example Scripts (below) for alternate use cases.
For simple cases, the number of nodes, cores per node, and processes per node
are specified using the directive:
#PBS -l select=N1:ncpus=N2:mpiprocs=N3
where N1 is the number of nodes you are requesting, N2 is the
number of cores on the targeted node type, and N3 is the number of MPI
processes per node.
3.2.5. SLB Directives
The Shared License Buffer (SLB) regulates shared license usage across all
HPC systems by granting and enforcing license reservations for certain commercial
software packages. If your job requires enterprise licenses controlled by
SLB, you must enter the software and requested number of licenses, using the
following directive:
#PBS -l software=number_of_licenses
For more information about SLB, please see the SLB User Guide.
3.3. The Execution Block
After the directives have been supplied, the execution block begins. The execution block is the section of your script containing the actual work to be done. This includes any modules to be loaded and commands to be executed. This could also include executing or sourcing other scripts.
3.3.1. Basic Execution Scheme
The following describes the most basic scheme for a batch script. PLEASE ADOPT THIS BASIC EXECUTION SCHEME IN YOUR OWN BATCH SCRIPTS.
Setup
- Set environment variables, load modules, create directories, transfer input files.
- Changing to the right directory - By default PBS runs your job in your home directory, which can cause problems. To avoid this, cd into your $WORKDIR directory to run it on the local high-speed disk.
Launching the executable
- Launch your executable using the launch command on Narwhal specific to your programming model.
Cleaning up
- Archive your results and remove temporary files and directories.
- Copy any necessary files to your home directory.
3.3.1.1. Setup
Using the batch script to set up your environment ensures your script runs in an automatic and consistent manner, but not all environment-setup tasks can be accomplished via scheduler directives, so you may have to set some environment variables yourself. Remember that commands to set up the environment must come after the scheduler directives. For MPI jobs, each MPI process is separate and inherits the environment set up by the batch script.
As part of the Baseline Configuration (BC) initiative, there is a common set of environment variables on all HPCMP allocated systems. These variables are predefined in your login, batch, and compute environments, making them automatically available at each center. We encourage you to use these variables in your scripts where possible. Doing so helps to simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems within the HPCMP. Some BC environment variables are shown in the table below.
Variable | Description |
---|---|
$WORKDIR | Your work directory on the local temporary file system (i.e., local high-speed disk). $WORKDIR is visible to both the login and compute nodes and should be used for temporary storage of active data related to your batch jobs. |
$CENTER | Your directory on the Center-Wide File System (CWFS). |
$ARCHIVE_HOME | Your directory on the archival file system that serves a given compute platform. |
The complete list of BC environment variables is available in BC Policy FY05-04.
Setup considerations in customizing your batch job may include:
- Creating a directory in $WORKDIR for your job run in.
NEWDIR=$WORKDIR/MyDir mkdir -p $NEWDIR
- Changing to the directory from which the job will run.
cd $NEWDIR
- Copying required input files to the job directory.
cp From_directory/file $NEWDIR
- Ensuring required modules are loaded.
module load module_name
3.3.1.2. Launching an Executable
The command you'll use to launch a parallel executable within a batch script depends on the parallel library loaded at compile and execution time, the programming model, and the machine used to launch the application. It does not depend on the scheduler. Launch commands on Narwhal are discussed in detail in the Narwhal User Guide.
On Narwhal, the mpiexec command is used to launch a parallel
executable. The basic syntax for launching an MPI executable is:
mpiexec args executable pgmargs
where args are command-line arguments for mpiexec, executable is the name of an executable program, and pgmargs are command-line arguments for the executable.
3.3.1.3. Cleaning Up
You are responsible for cleaning up and monitoring your workspace. The clean-up process generally entails deleting unneeded files and transferring important data left in the job directory after the job is completed. It is important to remember that $WORKDIR is a "scratch" file system and is not backed up. Currently, $WORKDIR files older than 21 days are subject to being purged. If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you WILL NOT be notified prior to deletion. Similarly, files transferred to $CENTER are not backed up, and files older than 180 days are subject to being purged. To prevent automatic deletion by the purge scripts, important data should be archived. See the Navy DSRC Archive Guide for more information on archiving data.
3.3.2. Advanced Execution Methods
A batch script is a text file containing directives and execution steps you "submit" to PBS. This script can be as simple as the basic execution scheme discussed above or include more complex customizations, such as compiling within the script or loading a file with a list of modules required for the executable. Below are additional considerations for the execution block of the batch script.
3.3.2.1. Environment Variables set by the Scheduler
In addition to environment variables inherited from your user environment (see Section 3.3.1.1), PBS sets other environment variables for batch jobs. The following table contains commonly used PBS environment variables.
Variable | Description |
---|---|
$PBS_JOBID | The job identifier assigned to a job or job array by the batch system |
$PBS_O_WORKDIR | The absolute path of the directory where the job was submitted |
$PBS_JOBDIR | The absolute path of the directory where the job runs |
$PBS_JOBNAME | The job name supplied by the user |
$PBS_QUEUE | The name of the queue in which the job executes |
$PBS_ENVIRONMENT | Indicates if the job is batch or interactive |
$PBS_ENVIRONMENT=PBS_BATCH | PBS_INTERACTIVE | |
$PBS_O_PATH | Copy of $PATH from your submission environment |
$PBS_O_HOST | Copy of $HOST from your submission environment |
$PBS_O_SHELL | Copy of $SHELL from your submission environment |
$PBS_NODEFILE | The name of a file containing a list of nodes assigned to the job |
$PBS_ARRAY_INDEX | The index number of a sub-job in a job array |
See the qsub man page for a complete list of environment variables set by PBS |
Baseline Configuration Policy (BC Policy FY05-04) defines an additional set of environment variables with related functionality available on all systems. These variables can also be found in the Narwhal User Guide.
3.3.2.2. Loading Modules
Software modules are a very convenient way to set needed environment variables and include necessary directories in your path so commands for applications can be found. For a full discussion on modules see the Narwhal User Guide and the Navy DSRC Modules Guide.
To ensure required modules are loaded at runtime, you can load them within
the batch script before the executable code by using the command:
module load module_name
3.3.2.3. Compiling on the Compute Nodes
You can compile on the compute nodes, either interactively or within a job script. On most systems this is the same as compiling on the login nodes, though in some cases there are differences between the login and compute nodes. See the Narwhal User Guide for more information.
3.3.2.4. Using the Transfer Queue
Before a job can run, the input data needs to be copied into a directory accessible by the job script. This can be done in a separate job script using the transfer queue. Because jobs in the transfer queue cost no allocation, the transfer queue is advantageous for large file transfers such as during data staging or cleanup to move data left in your $WORKDIR after your application completes.
When using the transfer queue, keep in mind:
- The transfer queue may have additional bandwidth for data transfers.
- Nodes in the transfer queue are shared with other users, so available compute and memory is likely lower.
- Your allocation is not charged when using the transfer queue.
See Example Scripts for an example for using the transfer queue.
3.4. Requesting Specialized Nodes
3.4.1. GPU Nodes
The graphics processing unit, or GPU, has become one of the most important types of computing technology. The GPU is made up of many synchronized cores working together for specialized tasks.
To request GPU-accelerated nodes, add the ngpus option to the
select directive, as follows:
#PBS -l select=N1:ncpus=N2:mpiprocs=N3:ngpus=N4
where N1 is the number of nodes you are requesting, N2 is the
number of cores on the targeted node type, N3 is the number of MPI
processes per node, and N4 is the number of GPUs required per node.
3.4.2. Visualization Nodes
Visualization nodes are GPU-accelerated nodes with specialized hardware or
software to support visualization tasks. Therefore, requesting a visualization
node is the same as requesting a single-gpu node. Simply add the
ngpus=1 option to the select directive, as follows, and submit
your job to the viz or HIE queue:
#PBS -l select=N1:ncpus=N2:mpiprocs=N3:ngpus=1
where N1 is the number of nodes you are requesting, N2 is the
number of cores on the targeted node type, N3 is the number of MPI
processes per node, and one GPU is requested.
3.4.3. Large-Memory Nodes
Large-Memory nodes are standard compute nodes with additional memory installed
to support applications that require larger amounts of memory. To request a
large-memory node, simply add the bigmem=1 option to the select directive, as follows:
#PBS -l select=N1:ncpus=N2:mpiprocs=N3:bigmem=1
where N1 is the number of nodes you are requesting, N2 is the
number of cores on the targeted node type, and N3 is the number of MPI
processes per node. Note that bigmem=1 is simply a flag to indicate that
large-memory nodes are being requested; it is not the number of nodes.
3.4.4. Single-GPU Machine Learning Accelerated (MLA) Nodes
Single-GPU MLA nodes are GPU-accelerated nodes with specialized hardware and software
to support machine-learning tasks. To request Single-GPU MLA nodes, simply add
the ngpus=1 option to the select directive, as follows, and submit
your job to a non-viz, non-HIE queue:
#PBS -l select=N1:ncpus=N2:mpiprocs=N3:ngpus=1
where N1 is the number of nodes you are requesting, N2 is the
number of cores on the targeted node, and N3 is the number of MPI processes
per node. Note that ngpus=1 is simply a flag to indicate that
Single-GPU MLA nodes are being requested; it is not the number of nodes.
3.4.5. Dual-GPU Machine Learning Accelerated (MLA) Nodes
Dual-GPU MLA nodes are GPU-accelerated nodes with specialized hardware and software
to support machine-learning tasks. To request Dual-GPU MLA nodes, simply add the
ngpus=2 option to the select directive, as follows:
#PBS -l select=N1:ncpus=N2:mpiprocs=N3:ngpus=2
where N1 is the number of nodes you are requesting, N2 is the
number of cores on the targeted node, and N3 is the number of MPI processes
per node. Note that ngpus=2 is simply a flag to indicate Dual-GPU
MLA nodes are being requested; it is not the number of nodes.
3.5. Advanced Considerations
3.5.1. Heterogeneous Computing (Using Multiple Node Types) and Node Distribution
Heterogeneous computing refers to using more than one type of node, such as CPU, GPU, or large-memory nodes. By assigning different workloads to specialized nodes suited for diverse purposes, performance and energy efficiency can be vastly improved. Node distribution refers to assigning tasks to groups or chunks of nodes. This section discusses how to schedule different node types and organize groups of nodes (heterogeneous or homogeneous) so they can be assigned different tasks.
A chunk is a set of resources allocated as a unit to a job. All parts of a
chunk come from the same node. A chunk is often referred to interchangeably
as a node, but technically a chunk can be smaller (yet never larger) than a
node. The distribution of tasks across chunks can be refined using a form of
the PBS select statement as follows:
#PBS -l select=N1:ncpus=N2:mpiprocs=N3[+N4:ncpus=N5:mpiprocs=N6[+...]
where N1 and N4 are the number of "chunks" you are requesting,
N2 and N5 are the number of cores on each chunk, and N3
and N6 are the number of MPI processes per node.
The following example selects one chunk of 128 cores and two chunks of 64 cores:
#PBS -l select=1:ncpus=128:mpiprocs=128+2:ncpus=64:mpiprocs=64
How the three chunks get placed onto nodes depends upon the placement setting,
which can be "pack" or "scatter." Pack fits the chunks on as few nodes as possible.
Scatter places each chunk on its own node, if possible. Because there are 128
cores per compute node, the following placement statement would "pack" all three
chunks on a total of two nodes because they will fit:
#PBS -l place=pack
The following placement statement would "scatter" one chunk per node for a
total of three nodes, with the last two nodes using half of the cores on each
node:
#PBS -l place=scatter
It is also possible to request heterogeneous resources as part of each chunk.
The following example requests two standard compute nodes: one Large-Memory
node and one GPU node.
#PBS -l select=2:ncpus=128:mpiprocs=128+1:bigmem=1+1:ngpus=1
Select statements have numerous other options. See the qsub and pbs_resources man pages for more information.
3.5.2. Hyper-Threading
On Narwhal, hyper-threading is enabled by default. This allows users to
run two tasks or threads per core instead of just one. For example, users can
set mpiprocs to two times the cores per node (256) even though
there are only 128 physical cores on each node.
#PBS -l select=4:ncpus=128:mpiprocs=256
The number of nodes requested for the job is available in the environment variable, $BC_NODE_ALLOC.
To determine the number of cores requested for the job simply multiply $BC_NODE_ALLOC by two.
3.6. Advance Reservation Service Jobs
The Advance Reservation Service (ARS) provides a web-based interface to batch schedulers on most allocated HPC resources in the HPCMP. This service allows allocated users to reserve resources for use at specific times and for specific durations. It works in tandem with selected schedulers to allow restricted access to those reserved resources.
For Advance Reservation Service (ARS) jobs, you must submit a reservation request. When your reservation is made, you receive a confirmation page and an email with the pertinent details of your reservation, including the ARS_ID. It is your responsibility to either use or cancel your reservation. Unless you cancel it, your allocation is charged for the full time on the reserved nodes whether you use them or not. For more information, such as how to cancel a reservation, see the ARS User Guide.
To use the reserved nodes, you must log onto the selected system and submit
a job specifying the ARS_ID as the queue, as follows:
#PBS -q ARS_ID
4. Submitting and Managing Your Job
Once your batch script is ready, submit it to the scheduler for execution,
and the scheduler will generate a job according to the parameters set in the
script. Submitting a batch script can be done with the qsub command:
qsub batch-script-name
Because batch scripts specify the resources for your job, you won't need
to specify any resources on the command line. However, you can override
or add any job parameter by providing the specific resource as a flag
directly on the qsub command line. Directives supplied in
this way override the same directives if they are already included in your
script. The syntax to supply directives on the command line is the same
as within a script except #PBS is not used. For
example:
qsub -l walltime=HHH:MM:SS batch-script-name
4.1. Scheduler Dos and Don'ts
When submitting your job, it's important to keep in mind these general guidelines:
- Request only the resources you need.
- Be aware of limits. If you request more resources than the hardware can offer, the scheduler might not reject the job, and it may be stuck in the queue forever.
- Be aware of the available memory limit. In general, the available memory per core is (memory_per_node)/(cores_in_use_on_the_node).
- The scheduler might not support pinning, so you might want to do this manually.
- There may be per-user quotas on the system.
You should also keep in mind that Narwhal is a shared resource. Behavior that negatively impacts other users or stresses the system administrators is not desirable. Below are some suggestions to be followed for a happy HPC community.
- Submitting 1000 jobs to perform 1000 tasks is naïve and can overload the scheduler. If these tasks are serial, it also wastes your allocation hours across 1000 nodes. Job arrays are strongly encouraged, see Section 6.
- If you expect your job to run for several days, split it into smaller jobs. You'll get reduced queue time and increased stability (e.g., against node failure). You can either split your job manually and submit as separate jobs or submit your jobs sequentially within a single script as described in the Navy DSRC Archive Guide.
- Send your job to the right queue. It is important to understand in which queue the scheduler will run your job as most queues have core and walltime limits.
- Do not run compute-intensive tasks from a login node. Doing so slows the login nodes, causing login delays for other users and may prompt administrators to terminate your tasks, often without notice.
4.2. Job Management Commands
Once you submit your job, there are commands available to check and manage your job submission. For example:
- Determining the status of your job
- Cancelling your job
- Putting your job on hold
- Releasing a job from hold
The table below contains commands for managing your jobs. Use man command or the command --help option to get more information about a command.
Command | Description |
---|---|
cqstat* | Display running and pending jobs, including estimated start times * |
pbsnodes | Display status of all PBS batch nodes. |
qdel job_id | Delete a job. |
qhold job_id | Place a job on hold. |
qrls job_id | Release a job from hold. |
qstat | Display the status of all jobs |
qstat job_id | Check the status of a job. |
qstat -q | Display the status of all queues. |
qsub script_file | Submit a job. |
Warning: qstat -f can produce a significant amount of data since the output also contains the full list job environment variables. Avoid using this with the -a flag or when performing simple monitoring of job states.
* Estimated start times with cqstat are only available for a small number of jobs and are subject to frequent change as new jobs are queued.
4.3. Job States
When checking the status of a job, the state of a job is listed. Jobs typically pass through several states during their execution. The main job cycle states are QUEUED/PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED. An explanation of each state follows.
Command | Description |
---|---|
Q | The job is queued, eligible to run, or routed. |
R | The job is running. |
H | The job is held. |
E | The job is exiting after having run. |
W | The job is waiting for its execution time. Only if start time is declared. |
T | The job is being moved. |
4.4. Baseline Configuration Common Commands
The Baseline Configuration Team (BCT) has established the following set of common commands that are consistent across all systems. Most are custom and not inherent in the scheduler.
Command | Description |
---|---|
bcmodule | Executes like the standard module command but has numerous improvements and new features. |
check_license | Checks the status of HPCMP shared applications grouped into two distinct categories: Software License Buffer (SLB) applications and non-SLB applications. |
cqstat | Displays information about jobs in the batch queueing system. |
node_use | Displays memory-use and load-average information for all login nodes of the system on which it is executed. |
qflag | Collects information about the user and the user's jobs and sends a message about the reported problem without any need to leave the HPC system. |
qhist | Prints a full report on a running or completed batch job with an option to include chronological log file entries for the job from the batch queueing system. The command can also list all completed jobs for a given user over a specified number of days in the past. |
qpeek | Returns the standard output (stdout) and standard error (stderr) messages for any submitted batch job from the start of execution until the job run is complete. |
qview | Displays various reports about jobs in the batch queuing system. |
show_queues | Displays current batch queuing system information. |
show_storage | Produces two reports on quota and usage information. |
show_usage | Produces two reports on the allocation and usage status of each subproject under which a user may compute. |
5. Optional Directives
In addition to the required directives mentioned above, PBS has many other directives, but most users only use a few of them. Some of the more useful optional directives are summarized below.
5.1. Job Application Directive (Recommended)
The application directive allows you to identify the application being used by your job. This directive is used for HPCMP accountability and administrative purposes and helps the HPCMP accurately assess application usage and ensure adequate software licenses and appropriate software are purchased. While not required, using this directive is strongly encouraged as it provides valuable data to the HPCMP regarding application use.
To use this directive, add a line in the following form to your batch script:
#PBS -l application=application_name
Or, to your qsub command
qsub -l application=application_name ...
A list of application names for use with this directive can be found in $SAMPLES_HOME/Application_Name/application_names on Narwhal.
5.2. Job Name Directive
The -N directive allows you to give your job a name that's
easier to remember than a numeric job ID. The PBS environment variable,
$PBS_JOBNAME, inherits this value and can be used instead of
the job ID to create job-specific output directories. To use this directive,
add the following to your batch script:
#PBS -N job_name
Or, to your qsub command
qsub -N job_name ...
5.3. Job Reporting Directives
Job reporting directives allow you to control what happens to standard output and standard error messages generated by your script. They also allow you to specify email options to be executed at the beginning and end of your job. The following table and sections describe the job reporting directives:
Directive | Description |
---|---|
#PBS -o filename | Redirect standard error (stderr) to the named file. Appends to the file if it exists, otherwise creates the file. |
#PBS -e filename | Redirect standard output (stdout) to the named file. Appends to the file if it exists, otherwise creates the file. |
#PBS -j eo | Merge stderr and stdout into stderr. |
#PBS -j oe | Merge stderr and stdout into stdout. |
#PBS -M email_address | Set the email address(es) to be used for email alerts. |
#PBS -m b | Send email when the job begins. |
#PBS -m e | Send email when the job ends. |
#PBS -m be | Send email when the job begins and ends. |
5.3.1. Redirecting stderr and stdout
By default, messages written to stdout and stderr are captured for you in files
named x.ojob_id and x.ejob_id, respectively, where
x is either the name of the script or the name specified with the
-N directive or an assigned number, and job_id is the ID of the
job. If you want to change this behavior, the -o and
-e directives allow you to redirect stdout and stderr messages
to different named files. To combine stdout and stderr into a single file,
use the -o directive with the -j oe directive or use
the -e directive and the -j eo directive. For example,
#PBS -o filename.out
#PBS -j oe
5.3.2. Setting up Email Alerts
Many users want to be notified when their jobs begin and end. The
-m directive makes this possible. If you use this directive, you
also need to supply the -M directive with one or more email addresses
to be used. For example:
#PBS -m be
#PBS -M user@email.address[,user2@email.address]
5.4. Job Environment Directives
Job environment directives allow you to control the environment in which your script will operate. This section describes some useful variables in setting up the script environment.
Directive | Description |
---|---|
qsub -I | Request an interactive job. |
#PBS -V | Export all environment variables from your login environment to your batch environment. |
#PBS -v variable1, variable2 | Export specific environment variables from your login environment to your batch environment. |
#PBS -v variable=value, ... | Export specific environment variables with specific values to your batch environment. |
#PBS -l pmem=sizeMB | sizeGB | Memory size per process. |
5.4.1. Interactive Batch Shell
When you log into Narwhal, you will be running in an interactive shell on a login node. The login nodes provide login access for Narwhal and support such activities as compiling, editing, and general interactive use by all users. Please note the Navy DSRC Login Node Abuse policy.
The preferred method to run resource intensive interactive executions is to use an interactive batch session. An interactive batch session allows you to run interactively (in a command shell) on a compute node after waiting in the batch queue.
Note: Once an interactive session starts, it uses the entire requested block of CPU time and other resources unless you exit from it early, even if you don't use it. To avoid unnecessary charges to your project, don't forget to exit an interactive session once finished.
The -I directive allows you to request an interactive batch shell.
Within that shell, you can perform normal Linux commands, including launching
parallel jobs. To use -I, append it to your qsub
request. For example:
qsub your_pbs_options -X -I
The -X directive enables X-Windows access and may be omitted if your interactive job doesn't use a GUI.
Windows users: Please be aware the HPC Help Desk does not provide support for the use of X11 clients with the HPCMP Kerberos Kit for Windows.
Interactive batch sessions are scheduled just like normal batch jobs, so depending on how many other batch jobs are queued, it may take some time. Once your interactive batch shell starts, you will be logged into the first compute node of those assigned to your job. At this point, you can run or debug interactive applications, execute job scripts, post-process data, etc. You can launch parallel applications on your assigned compute nodes by using an MPI or other parallel launch command.
The HPC Interactive Environment (HIE) provides an HIE queue specifically for interactive jobs. It offers longer job times and has nodes reserved only for HIE, so queue wait times are sometimes much shorter. However, HIE has limitations, such as only allowing the use of a single node at a time. See the HIE User Guide for more information before using the HIE queue.
5.4.2. Export Environment Variables
Batch jobs run with their own environment, separate from the login environment from which the batch job is launched. If your application is dependent on environment variables set in the login environment, you need to export these variables from the login environment to the batch environment.
The -V directive tells PBS to export all environment variables
from your login environment to your batch environment. To use this directive,
add the following line to your batch script:
#PBS -V
Or, add it to your qsub command, as follows:
qsub -V ...
For exporting specific environment variables from your login environment, use
the -v directive. To use this directive, add a line in the following
form to your batch script:
#PBS -v my_variable
Or, add it to your qsub command, as follows:
qsub -v my_variable ...
Using either of these methods, multiple comma-separated variables can be included.
It is also possible to set values for variables exported in this way, as follows:
qsub -v my_variable=my_value, ...
5.4.3. Memory Size
The pmem=size directive is used to specify the maximum
amount of physical memory used by any process in the job in bytes. For example,
if the job would run four processes and each needs up to 2 GB of memory, then
the directive would read:
#PBS -l pmem=2GB
5.5. Job Dependency Directives
Directive | Description |
---|---|
after | Execute this job after listed jobs have begun. |
afterok | Execute this job after listed jobs have terminated without error. |
afternotok | Execute this job after listed jobs have terminated with an error. |
afterany | Execute this job after listed jobs have terminated for any reason. |
before | Listed jobs may be run after this job begins execution. |
beforeok | Listed jobs may be run after this job terminates without error. |
beforeok | Listed jobs may be run after this job terminates without error. |
beforenotok | Listed jobs may be run after this job terminates with an error. |
beforeany | Listed jobs may be run after this job terminates for any reason. |
Job dependency directives allow you to specify dependencies your job may have
on other jobs. This allows you to control the order jobs run in. These directives
generally take the following form:
#PBS -W depend=dependency_expression
where dependency_expression is a comma-delimited list of one or more
dependencies, and each dependency is of the form:
type:jobids
where type is one of the directives listed below, and jobids
is a colon-delimited list of one or more job IDs your job is dependent upon.
For example, to run a job after completion (success or failure) of job ID 1234:
#PBS -W depend=afterany:1234
To run a job after successful completion of job ID 1234:
#PBS -W depend=afterok:1234
For more information about job dependencies, see the qsub man page.
6. Job Arrays
Imagine you have several hundred jobs that are all identical except for two or three parameters whose values vary across a range of input values. Submitting all these jobs individually would not only be tedious but would also incur a lot of overhead, which would impose a significant strain on the scheduler, negatively impacting all scheduled jobs. This example is not an uncommon use case, and it is the reason why job arrays were invented.
Job arrays let you submit and manage collections of similar jobs quickly and easily within a single script, which can significantly relieve the strain on the queueing system. Resource directives are specified once at the top of a job array script and are applied to each array task. As a result, each task has the same initial options (e.g., size, wall time, etc.) but may have different input values.
If your use case includes 200 or more similar jobs that vary by only a few parameters, job arrays are highly recommended.
To implement a PBS job array in your job script, include the directives:
#PBS -r y
#PBS -J n-m[:step]
where n is the starting index, m is the ending index, and the
optional step is the increment. The -r y directive flags
the job as rerunnable, which tells PBS the script is a job array. PBS then
queues this script in FLOOR[(m-n)/step+1] instances,
each of which receives its index in the $PBS_ARRAY_INDEX environment
variable. You can use the command echo $PBS_ARRAY_INDEX to output
the unique index of a job instance. After submitting a job array job, the PBS
job ID appears with left and right brackets, [ ], in the output of
qstat. So, a job ID that would normally look like "384294" instead
looks like "384294[ ]".
Let's look at an explicit example using the PBS directive:
#PBS -J 1-999:2
PBS runs 500 instances (FLOOR( (1000-2)/2+1) = 500) of your script, each with a unique value of $PBS_ARRAY_INDEX ranging from 1, 3, 5, 7, ..., 999.
7. Example Scripts
This section provides sample scripts you may copy and use. All scripts follow the anatomy presented in Section 3 and have been tested in their respective scheduler environment. When you use any of these examples, remember to substitute your own Project_ID, job name, output and error files, executable, and clean up. More advanced scripts can be found under the $SAMPLES_HOME directory on the system. Assorted flavors of Hello World are provided in Section 8. These simple programs can be used to test these scripts.
The following Baseline Configuration variables are used in the scripts below.
Variable | Description |
---|---|
$BC_CORES_PER_NODE | The number of cores per node for the compute node on which a job is running. |
$BC_MEM_PER_NODE | The approximate maximum memory per node available to an end user program (in integer MB) for the compute node type to which a job is being submitted. |
$BC_MPI_TASKS_ALLOC | Intended to be referenced from inside a job script, contains the number of MPI tasks/ranks allocated for a particular job. |
$BC_NODE_ALLOC | Intended to be referenced from inside a job script, contains the number of nodes allocated for a particular job. |
7.1. Simple Batch Script
The following is a very basic script to demonstrate requesting resources (including all required directives), setting up the environment, specifying the execution block (i.e., the commands to be executed), and cleaning up after your job completes. Save this as a regular text file using your editor of choice.
#!/bin/bash ################################################################# # Description: This basic bash shell script for a simple job. # The job can be submitted to the standard queue # with the following command: "qsub basic.pbs" # Use the "show_usage" command to get your PROJECT_ID(s). ################################################################# # REQUIRED DIRECTIVES ----------------------------------------- ################################################################# # Account to be charged #PBS -A Project_ID # Set max wall time to 10 minutes #PBS -l walltime=00:10:00 # Run the job in the standard queue #PBS -q standard # Select 1 node, with all the cpus and 16 processes #PBS -l select=1:ncpus=128:mpiprocs=16 ################################################################### # RECOMMENDED DIRECTIVES --------------------------------------- ################################################################### #PBS -l application=other ################################################################### # OPTIONAL DIRECTIVES ------------------------------------------- ################################################################### # Name the job #PBS -N jobName # Change stdout and stderr filenames #PBS -o filename.out #PBS -e filename.err ################################################################### # EXECUTION BLOCK ----------------------------------------------- ################################################################### # Change to the default working directory cd $WORKDIR echo "Working directory is $WORKDIR" echo "-----------------------" echo "-- Executable Output --" echo "-----------------------" # Run the job # Note: From the ‘#PBS -l select...' statement above # BC_MPI_TASKS_ALLOC = 1*16 = (select)*(mpiprocs) # mpiexec -n 16 ./executable # OR mpiexec -n $BC_MPI_TASKS_ALLOC ./executable ################################################################### # CLEAN UP -------------------------------------------------------- ################################################################### # Remove temporary files and move data to non-scratch directory # (Home or archive) # See the "Archival In Compute Jobs" section (Section 4) of the # Navy DSRC Archive Guide for a detailed example of performing # archival operations within a job script.
7.2. Job Information Batch Script
The following examples can be included in the Execution block of any job script. The first example shows Baseline Configuration environment variables available on all HPCMP systems. The second example shows scheduler-specific variables.
################################################################# # Job information set by Baseline Configuration variables ################################################################# echo ---------------------------------------------------------- echo "Type of node " $BC_NODE_TYPE echo "CPU cores per node " $BC_CORES_PER_NODE echo "CPU cores per standard node " $BC_STANDARD_NODE_CORES echo "CPU cores per accelerator node " $BC_ACCELERATOR_NODE_CORES echo "CPU cores per big memory node " $BC_BIGMEM_NODE_CORES echo "Hostname " $BC_HOST echo "Maxumum memory per nodes " $BC_MEM_PER_NODE echo "Number of tasks allocated " $BC_MPI_TASKS_ALLOC echo "Number of nodes allocated " $BC_NODE_ALLOC echo "Working directory " $WORKDIR echo ---------------------------------------------------------- ################################################################# # Output some useful job information. ############################################################## echo "-------------------------------------------------------" echo "User " $PBS_O_LOGNAME echo "User home directory " $PBS_O_HOME echo "Job submission directory " $PBS_O_WORKDIR echo "Submit host " $PBS_O_HOST echo "Job name " $PBS_JOBNAME echo "Job identifier " $PBS_JOBID echo "Job Type " $PBS_ENVIRONMENT echo "Working directory " $WORKDIR echo "Job execution directory " $PBS_JOBDIR echo "Job Originating queue " $PBS_O_QUEUE echo "Job execution queue " $PBS_QUEUE echo"----------------------------------------------------------"
7.3. OpenMP Script
To run a pure OpenMP job, specify the number of cores you want from the node (ncpus). Also specify the number of threads (ompthreads) or $OMP_NUM_THREADS defaults to the value of ncpus, possibly resulting in poor performance. Differences between the Simple Batch Script and this script are highlighted.
#!/bin/bash ################################################################## # REQUIRED DIRECTIVES ------------------------------------------ ################################################################## #PBS -A Project_ID #PBS -l walltime=00:10:00 #PBS -q standard #### Use a single node #PBS -l select=1:ncpus=128:mpiprocs=1 ################################################################## # OPTIONAL DIRECTIVES ------------------------------------------ ################################################################## #PBS -N jobname # Change stdout and stderr filenames #PBS -o filename.out #PBS -e filename.err ################################################################## # EXECUTION BLOCK ---------------------------------------------- ################################################################## # Change to the default working directory cd $WORKDIR echo "Working directory is $WORKDIR" export OMP_NUM_THREADS=$BC_CORES_PER_NODE # Run the job from the default working directory ./openMP_executable ################################################################## # CLEAN UP ----------------------------------------------------- ################################################################## # Remove temporary files and move data to non-scratch directory # (Home or archive) # See the "Archival In Compute Jobs" section (Section 4) of the # Navy DSRC Archive Guide for a detailed example of performing # archival operations within a job script.
7.4. Hybrid (MPI/OpenMP) Script
Hybrid MPI/OpenMP scripts are required for executables that MPI between cores and OpenMP inside each core. The following script is an example of hybrid MPI and OpenMP. Differences between the Simple Batch Script and this script are highlighted.
#!/bin/bash ################################################################## # REQUIRED DIRECTIVES ------------------------------------------ ################################################################## #PBS -A Project_ID #PBS -l walltime=00:10:00 #PBS -q standard #PBS -l select=2:ncpus=128:mpiprocs=1 ################################################################## # OPTIONAL DIRECTIVES ------------------------------------------ ################################################################## #PBS -N jobname # Change stdout and stderr filenames #PBS -o filename.out #PBS -e filename.err ################################################################## # EXECUTION BLOCK ---------------------------------------------- ################################################################## cd $WORKDIR echo "working directory is ${WORKDIR}" export OMP_WAIT_POLICY=PASSIVE export OMP_NUM_THREADS=$BC_CORES_PER_NODE # # 2 nodes (256 cores), one MPI process per node # 128 OpenMP threads per node (one per core) # mpiexec -n 2 -d 128 hybrid_executable # OR mpiexec -n $BC_NODE_ALLOC -d $BC_CORES_PER_NODE hybrid_executable # ################################################################## # CLEAN UP ------------------------------------------------------- ################################################################## # Remove temporary files and move data to non-scratch directory # (Home or archive) # See the "Archival In Compute Jobs" section (Section 4) of the # Navy DSRC Archive Guide for a detailed example of performing # archival operations within a job script.
7.5. Accessing More Memory per Process
By default, an MPI job runs one process per core, with all processes sharing
the available memory on the node. On Narwhal each compute node has 128 cores
and 238 GB of memory. Assuming one process per core, the memory per process
is:
memory per process = 238 GB/128
If you need more memory per process, then your job needs to run fewer MPI processes per node. This means number_of_processes_per_node < 128. For example, if you request 4 nodes and use only 16 out of 128 cores, this results in a total of 4*16=64 MPI processes. Each of the 16 MPI process per node will have access to approximately 238 GB/16 of memory.
The following script demonstrates this example by requesting 4 nodes and setting 16 processes per node. The job runs for two hours in the standard queue. For more information, refer to the Samples section in the Narwhal User Guide. Note: Differences between the Simple Batch Script and this script are highlighted.
Another way to get more memory per process is to run on bigmem nodes, which is discussed in the next section. However, because there are few bigmem nodes on the system, if you need many cores, bigmem nodes may not be an option.
#!/bin/bash ################################################################## # REQUIRED DIRECTIVES ------------------------------------------ ################################################################## #PBS -A ProjectID #PBS -l walltime=00:10:00 #PBS -q standard # Starts 64 MPI processes; only 16 processes on each node # This will result in each process having a memory size of 238 GB/16 #PBS -l select=4:ncpus=128:mpiprocs=16 ################################################################## # OPTIONAL DIRECTIVES ------------------------------------------ ################################################################## #PBS -N jobname # Change stdout and stderr filenames #PBS -o filename.out #PBS -e filename.err ################################################################## # EXECUTION BLOCK ------------------------------------------------ ################################################################## cd $WORKDIR echo "working directory is ${WORKDIR}" # Execute the application on 4 nodes using # 16 processes on each node for a total of 64 MPI processes mpiexec -n 64 ./executable ################################################################## # CLEAN UP ------------------------------------------------------- ################################################################## # Remove temporary files and move data to non-scratch directory # (Home or archive) # See the "Archival In Compute Jobs" section (Section 4) of the # Navy DSRC Archive Guide for a detailed example of performing # archival operations within a job script.
7.6. GPU Script
Here is a short example of a script for submitting jobs to a GPU node. Differences between the Simple Batch Script and this script are highlighted.
#!/bin/bash ################################################################## # Script must be run in the nvidia environment: # module swap PrgEnv-cray PrgEnv-nvidia ################################################################## # REQUIRED DIRECTIVES ------------------------------------------ ################################################################## #PBS -A Project_ID #PBS -l walltime=00:10:00 #PBS -q standard #PBS -l select=2:ncpus=128:ngpus=1 ################################################################## # OPTIONAL DIRECTIVES ------------------------------------------ ################################################################## #PBS -N jobName # Change stdout and stderr filenames #PBS -o filename.out #PBS -e filename.err ################################################################## # EXECUTION BLOCK ---------------------------------------------- ################################################################## cd $WORKDIR # Run the job from the default working directory ./GPU_executable ################################################################## # CLEAN UP ------------------------------------------------------- ################################################################## # Remove temporary files and move data to non-scratch directory # (Home or archive) # See the "Archival In Compute Jobs" section (Section 4) of the # Navy DSRC Archive Guide for a detailed example of performing # archival operations within a job script.
7.7. Data Transfer Script
The transfer queue is a special-purpose queue for transferring or archiving files. It has access to $HOME, $ARCHIVE_HOME, $WORKDIR, and $CENTER. Jobs running in the transfer queue are charged for a single core against your allocation. Differences between the Simple Batch Script and this script are highlighted.
#!/bin/bash ################################################################## # REQUIRED DIRECTIVES ------------------------------------------ ################################################################## #PBS -A Project_ID #PBS -l walltime=00:10:00 #PBS -q transfer #PBS -l select=1:ncpus=1 ################################################################## # OPTIONAL DIRECTIVES ------------------------------------------ ################################################################## #PBS -N jobName #PBS -o filename.out #PBS -e filename.err ################################################################## # EXECUTION BLOCK ------------------------------------------------ ################################################################## # Change to work director cd $WORKDIR # Assume all files to be transferred from are in $WORKDIR/from_dir export FROM_DIR=$WORKDIR/from_dir # Assume all files are to be transferred to $ARCHIVE_HOME export TO_DIR=$ARCHIVE_HOME # Create a gzip file to reduce data transfer time tar -czf $FROM_DIR.gz . # If needed, uncomment to create a directory on the archive # archive mkdir -C $TO_DIR # List the archive directory contents to verify data transfer archive ls -al $TO_DIR echo "Transfer job ended"
7.8. Job Array Script
As was discussed in Section 6, job arrays allow you to leverage a scheduler's ability to create multiple jobs from one script. Many of the situations where this is useful include:
- Establishing a list of commands to run and have a job created from each command in the list.
- Running many parameters against one set of data or analysis program.
- Running the same program multiple times with different sets of data.
Creating directories and output files that are unique to each job array's task is essential when using job arrays. This is shown in the script below. Use qsub -r y job_script to submit the job to the PBS scheduler. The -r y flag indicates the job is reusable. After submitting the job array script, use qstat -sw jobArrayID[] (e.g., qstat -sw 468028[]) to show the output from a queued job array job.
#!/bin/bash ################################################################## # REQUIRED DIRECTIVES ------------------------------------------ ################################################################## #PBS -A Project ID #PBS -l walltime=00:20:00 #PBS -q standard #PBS -l select=1:ncpus=128:mpiprocs=128 # Set up a job array from 1 to 4 in steps of 1. #PBS -r y #PBS -J 1-4:1 ################################################################## # EXECUTION BLOCK ---------------------------------------------- ################################################################## cd $WORKDIR JA_ID = `echo $PBS_JOBID | cut -d'[' -f1` JA_DIR = $WORKDIR/Job_Array.o${JA_ID} # Output Job ID and Job array index information echo "PBS Job ID PBS_JOBID is $PBS_JOBID" echo "PBS job array index PBS_ARRAY_INDEX value is $PBS_ARRAY_INDEX" # # Make a directory for each task in the array mkdir $JA_DIR # # Change into to task specific directory to run each task cd $JA_DIR # # Retrieve the job's binary cp $WORKDIR/executable $JA_DIR/executable_$PBS_ARRAY_INDEX # # Run job and redirect output Export outfile=$JA_DIR/$JA_ID_$PBS_ARRAY_INDEX mpiexec -n 128 ./executable_$PBS_ARRAY_INDEX &> $outfile ################################################################## # CLEAN UP ------------------------------------------------------- ################################################################## # Remove temporary files and move data to non-scratch directory # (Home or archive) # See the "Archival In Compute Jobs" section (Section 4) of the # Navy DSRC Archive Guide for a detailed example of performing # archival operations within a job script.
7.9. Large-Memory Node Script
The standard compute nodes on Narwhal contain 238 GB of RAM and 128 cores. That works out to 1.86 GB/core. This is fine for most jobs running on the system. However, some jobs require more memory per core. To accommodate these jobs, Narwhal has 26 large-memory nodes with 995 GB of memory. You can allocate a job on the large-memory nodes by submitting a large-memory job script. Differences between the Simple Batch Script and this script are highlighted.
#!/bin/bash ################################################################## # REQUIRED DIRECTIVES ------------------------------------------ ################################################################## #PBS -A Project_ID #PBS -l walltime=00:10:00 #PBS -q standard #PBS -l select=1:ncpus=128:mpiprocs= num_processes:bigmem=1 ################################################################## # OPTIONAL DIRECTIVES ------------------------------------------ ################################################################## #PBS -N jobName # Change stdout and stderr filenames #PBS -o filename.out #PBS -e filename.err ################################################################## # EXECUTION BLOCK ---------------------------------------------- ################################################################## cd $WORKDIR echo "working directory is ${WORKDIR}" mpiexec -n num_processes executable # ################################################################## # CLEAN UP ------------------------------------------------------- ################################################################## # Remove temporary files and move data to non-scratch directory # (Home or archive) # See the "Archival In Compute Jobs" section (Section 4) of the # Navy DSRC Archive Guide for a detailed example of performing # archival operations within a job script.
8. Hello World Examples
This section provides code to differing examples of the basic hello.c program. Refer to the Narwhal User Guide for information about compiling.
8.1. C Program - hello.c
/************************************************************** * A simple program to demonstrate an MPI executable ***************************************************************/ #include <mpi.h> #include <stdio.h> int rank; int numNodes; char processorName[MPI_MAX_PROCESSOR_NAME]; int nameLen; int main(int argc, char** argv) { // Initialize the MPI environment MPI_Init(NULL, NULL); // Get the number of nodes MPI_Comm_size(MPI_COMM_WORLD, &numRanks); // Get the rank of this process MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get the name of the processor MPI_Get_processor_name(processorName, &nameLen); // Print messages from each processor printf("Hello from processor %s - ", processorName); printf("I am rank %d out of %d ranks\n", rank, numRanks); // Finalize the MPI environment MPI_Finalize(); } // end main
8.2. OpenMP - hello-OpenMP.c
/************************************************************* * A simple program to demonstrate a pure OpenMPI executable ***************************************************************/ #include <stdio.h> #include <stdlib.h> #include <omp.h> // needed for OpenMP #include <unistd.h> // only needed for definition of gethostname #include <sys/param.h> // only needed for definition of MAXHOSTNAMELEN int main (int argc, char *argv[]) { int th_id, nthreads; char foo[] = "Hello"; char bar[] = "World"; char hostname[MAXHOSTNAMELEN]; gethostname(hostname, MAXHOSTNAMELEN); #pragma omp parallel private(th_id) { th_id = omp_get_thread_num(); printf("%s %s from thread %d on %s!\n", foo, bar, th_id, hostname); #pragma omp barrier if ( th_id == 0 ) { nthreads = omp_get_num_threads(); printf("There were %d threads on %s!\n", nthreads, hostname); } } return EXIT_SUCCESS; }
8.3. Hybrid MPI/Open MP - hello-hybrid.c
/***************************************************************** * A simple program to demonstrate a Hybrid MPI/OpenMP executable *****************************************************************/ #include <stdio.h> #include <omp.h> #include "mpi.h" int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; int iam = 0, np = 1; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); #pragma omp parallel default(shared) private(iam, np) { np = omp_get_num_threads(); iam = omp_get_thread_num(); printf("Hello from thread %d of %d, iam, np); printf( from process %d out of %d on %s\n", rank, numprocs, processor_name); } MPI_Finalize(); }
8.4. Cuda - hello-cuda.cu
/***************************************************************** * A simple program to demonstrate a CUDA/GPU executable * May require a module swap: * module swap PrgEnv-cray PrgEnv-nvidia * Check User manual for compiling GPU code ******************************************************************/ #include <stdio.h> #include <stdlib.h> #include <cuda.h> void cuda_device_init(void) { int ndev; cudaGetDeviceCount(&ndev); cudaDeviceSynchronize(); if (ndev == 1) printf("There is %d GPU.\n",ndev); else printf("There are %d GPUs\n",ndev); for(int i=0;i<ndev;i++) { cudaDeviceProp pdev; cudaGetDeviceProperties(&pdev,i); cudaDeviceSynchronize(); printf("Hello from GPU %d\n",i); printf("GPU type : %s\n",pdev.name); printf("Memory Global: %d Mb\n",\ (pdev.totalGlobalMem+1024*1024)/1024/1024); printf("Memory Const : %d Kb\n",pdev.totalConstMem/1024); printf("Memory Shared: %d Kb\n",pdev.sharedMemPerBlock/1024); printf("Clock Rate : %.3f GHz\n",pdev.clockRate/1000000.0); printf("Number of Processors : %d\n",pdev.multiProcessorCount); printf("Number of Cores : %d\n",8*pdev.multiProcessorCount); printf("Warp Size : %d\n",pdev.warpSize); printf("Max Thr/Blk : %d\n",pdev.maxThreadsPerBlock); printf("Max Blk Size : %d %d %d\n",\ pdev.maxThreadsDim[0],pdev.maxThreadsDim[1],\ pdev.maxThreadsDim[2]); printf("Max Grid Size: %d %d %d\n",\ pdev.maxGridSize[0],pdev.maxGridSize[1],\ pdev.maxGridSize[2]); } } int main(int argc, char * argv[]) { cuda_device_init(); return 0; } /************************************************************** * Compile Script for hello-cuda on Narwhal ***************************************************************/ #!/bin/bash # . $MODULESHOME/init/bash module swap PrgEnv-cray PrgEnv-nvidia # set -x # nvcc -o hello-cuda.exe hello-cuda.cu
9. Batch Scheduler Rosetta
User Commands | PBS | Slurm | LSF |
---|---|---|---|
Job Submission | qsub Script_File | sbatch Script_File | bsub < Script_File |
Job Deletion | qdel Job_ID | scancel Job_ID | bkill Job_ID |
Job status (by job) |
qstat Job_ID | squeue Job_ID | bjobs Job_ID |
Job status (by user) |
qstat -u User_Name | squeue -u User_Name | bjobs -u User_Name |
Job hold | qhold Job_ID | scontrol hold Job_ID | bstop Job_ID |
Job release | qrls Job_ID | scontrol release Job_ID | bresume Job_ID |
Queue list | qstat -Q | squeue | bqueues |
Node list | pbsnodes -l | sinfo -N OR scontrol show nodes | bhosts |
Cluster status | qstat -a | sinfo | bqueues |
GUI | xpbsmon | sview | xlsf OR xlsbatch |
Environment | PBS | Slurm | LSF |
Job ID | $PBS_JOBID | $SLURM_JOBID | $LSB_JOBID |
Submit Directory | $PBS_O_WORKDIR | $SLURM_SUBMIT_DIR | $LSB_SUBCWD |
Submit Host | $PBS_O_HOST | $SLURM_SUBMIT_HOST | $LSB_SUB_HOST |
Node List | $PBS_NODEFILE | $SLURM_JOB_NODELIST | $LSB_HOSTS/LSB_MCPU_HOST |
Job Array Index | $PBS_ARRAYID | $SLURM_ARRAY_TASK_ID | $LSB_JOBINDEX |
Job Specification | PBS | Slurm | LSF |
Script Directive | #PBS | #SBATCH | #BSUB |
Queue | -q Queue_Name | ARL: -p Queue_Name AFRL and Navy: -q Queue_Name |
-q Queue_Name |
Node Count | -l select=N1:ncpus=N2: mpiprocs=N3 (N1 = Node count N2 = Max cores per node N3 = Cores to use per node) |
-N min[-max] | -n CoreCount -R "span[ptile=CoresPerNode]" (NodeCount = CoreCount / Cores Per Node) |
Core Count | -l select=N1:ncpus=N2: mpiprocs=N3 (N1 = Node count N2 = Max cores per node N3 = Cores to use per node Core Count = N1 x N3) |
--ntasks=total_cores_in_run | -n Core_Count |
Wall Clock Limit | -l walltime=hh:mm:ss | -t min OR -t days-hh:mm:ss |
-W hh:mm |
Standard Output File | -o File_Name | -o File_Name | -o File_Name |
Standard Error File | -e File_Name | -e File_Name | -e File_Name |
Combine stdout/err | -j oe (both to stdout) OR -j eo (both to stderr) |
(use -o without -e) | (use -o without -e) |
Copy Environment | -V | --export=ALL|NONE|Variable_List | |
Event Notification | -m [a][b][e] | --mail-type=[BEGIN],[END],[FAIL] | -B or -N |
Email Address | -M Email_Address | --mail-user=Email_Address | -u Email_Address |
Job Name | -N Job_Name | --job-name=Job_Name | -J Job_Name |
Job Restart | -r y|n | --requeue OR --no-requeue (NOTE: configurable default) |
-r |
Working Directory | No option – defaults to home directory | --workdir=/Directory/Path | No option – defaults to submission directory |
Resource Sharing | -l place=scatter:excl | --exclusive OR --shared |
-x |
Account to charge | -A Project_ID | --account=Project_ID | -P Project_ID |
Tasks per Node | -l select=N1:ncpus=N2: mpiprocs=N3 (N1 = Node count N2 = Max cores per node N3 = Cores to use per node) |
--tasks-per-node=count | |
Job Dependency | -W depend=state:Job_ID[:Job_ID...][,state:Job_ID[:Job_ID...]] | --depend=state:Job_ID | -w done|exit|finish |
Job host preference | --nodelist=nodes AND/OR --exclude=nodes |
-m Node_List (i.e., "inf001" -or- inf[001-128]) OR -m node_type (i.e., "inference", "training", or "visualization") |
|
Job Arrays | -J N-M[:step][%Max_Jobs] | --array=N-M[:step] | -J "Array_Name[N-M[:step]][%Max_Jobs]" (Note: bold black brackets are literal) |
Generic Resources | -l other=Resource_Spec | --gres=Resource_Spec | |
Licenses | -l app=number Example: -l abaqus=21 (Note: license resource allocation) |
-L app:number | Example -L abaqus:21 -R "rusage[License_Spec]" (Note: brackets are literal) |
Begin Time | -a [[[YYYY]MM]DD]hhmm[.ss] (Note: no delimiters) |
--begin=YYYY-MM-DD[Thh:mm[:ss]] | -b [[YYYY:][MM:]DD:]hh:mm |
10. Glossary
- Batch-scheduled :
- users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available
- Batch Script :
- A script that provides resource requirements and commands for the job.
- Pinning :
- Pinning threads for shared-memory parallelism or binding processes for distributed-memory parallelism is an advanced way to control how your system distributes the threads or processes across the available cores.