Warhawk User Guide

1. Introduction

1.1. Document Scope and Assumptions

This document provides an overview and introduction to the use of the HPE Cray EX (Warhawk) located at the AFRL DSRC, along with a description of the specific computing environment on the system. The intent of this guide is to provide information that will enable the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:

  • Use of the Linux operating system
  • Use of an editor (e.g., vi or emacs)
  • Remote use of computer systems via network
  • A selected programming language and its related tools and libraries

1.2. DSRC Policies

All policies are discussed in the Policies Section of the AFRL DSRC Introductory Site Guide. All users running at the AFRL DSRC are expected to know, understand, and follow the policies discussed. If you have any questions about the AFRL DSRC's policies, please contact the HPC Help Desk.

1.3. Obtaining an Account

To begin the account application process, visit the Obtaining an Account page and follow the instructions presented there. An HPC Help Desk video is available to guide you through the process.

1.4. Training

Training on a number of topics in this User Guide is available at the PET Knowledge Management Learning System. New account holders should strongly consider attending HPCMP New Account Orientation, which is provided via live webcast every month and available as an on-demand video.

1.5. Requesting Assistance

The HPC Help Desk is available to assist users with unclassified problems, issues, or questions. Technicians are on duty 8:00 a.m. to 8:00 p.m. Eastern, Monday - Friday (excluding Federal holidays).

For more information about requesting assistance, see the HPC Help Desk dropdown.

2. System Configuration

2.1. System Summary

Warhawk is an HPE Cray EX. It has seven login nodes Node - an individual server in a cluster or collection of servers of an HPC system and four types of compute nodes for job execution. Warhawk uses Cray Slingshot in a Dragonfly configuration as its high-speed interconnect Interconnect - a specialized, very high-speed network that connects the nodes of an HPC system together. It is typically used for application inter-process communication (e.g., message passing) and I/O traffic. for MPI messages and IO traffic. Warhawk uses Lustre to manage its parallel file system Parallel File System - a specialized, high-speed storage system for an HPC system capable of scaling up to higher speeds for larger HPC workloads.

Node Configuration
Login Standard Large-Memory Visualization Machine-Learning Accelerated
Total Nodes 7 1,024 4 24 40
Processor AMD 7H12 Rome AMD 7H12 Rome AMD 7H12 Rome AMD 7H12 Rome AMD 7H12 Rome
Processor Speed 2.6 GHz 2.6 GHz 2.6 GHz 2.6 GHz 2.6 GHz
Sockets / Node 2 2 2 2 2
Cores / Node 128 128 128 128 128
Total CPU Cores 896 131,072 512 3,072 5,120
Usable Memory / Node 995 GB 503 GB 995 GB 503 GB 503 GB
Accelerators / Node None None None 1 2
Accelerator N/A N/A N/A NVIDIA V100 PCIe 3 NVIDIA V100 PCIe 3
Memory / Accelerator N/A N/A N/A 32 GB 32 GB
Storage on Node None None None None None
Interconnect Cray Slingshot Cray Slingshot Cray Slingshot Cray Slingshot Cray Slingshot
Operating System SLES SLES SLES SLES SLES

2.2. Login and Compute Nodes

Warhawk is intended as a batch-scheduled Batch-scheduled - users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available HPC system with numerous nodes. Its login nodes Login Node - a node that serves as the user's entry point into an HPC system are for minor setup, housekeeping, and job preparation tasks and are not used for large computational (e.g., memory, IO, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes Compute Node - a node that performs computational tasks for the user. There may be multiple types of compute nodes for specialized purposes. by batch job Batch Job - a single request for a set of compute nodes along with a set of tasks (usually in the form of a script) to perform on those nodes submission. Node types such as "Standard", "Large-Memory", "Machine-Learning Accelerated", etc. are considered compute nodes. Warhawk uses both shared Shared Memory Model - a programming methodology where a set of processors (such as the cores within one node) have direct access to a shared pool of memory and distributed Distributed Memory Model - a programming methodology where memory is distributed across multiple nodes giving processes on each node faster direct access to local memory, but requiring slower techniques such as message passing to access memory on other nodes memory models. Memory is shared among all the cores on one node but is not shared among the nodes across the cluster.

Warhawk's login nodes use AMD 7H12 Rome processors with 995 GB of usable memory. All memory and cores on the node are shared among all users who are logged in. Therefore, users should not use more than 8 GB of memory at any one time.

Warhawk's standard compute nodes use AMD 7H12 Rome processors. Each node contains 503 GB of usable shared memory. Standard compute nodes are intended for typical compute jobs.

Warhawk's large-memory compute nodes use AMD 7H12 Rome processors. Each node contains 995 GB of usable shared memory. Large-memory compute nodes are intended for jobs requiring large amounts of memory.

Warhawk's visualization nodes consist of AMD 7H12 Rome processors paired with 1 NVIDIA V100 PCIe 3 GPU. Each node contains 504 GB of usable shared memory on the node, as well as 32 GB of shared memory internal to each GPU. Visualization compute nodes are intended for hardware accelerated graphics.

Warhawk's machine-learning accelerated (MLA) nodes consist of AMD 7H12 Rome processors paired with 2 NVIDIA V100 PCIe 3 GPUs. Each node contains 504 GB of usable shared memory on the node, as well as 32 GB of shared memory internal to each GPU. MLA compute nodes are intended for intensive GPU applications such as machine learning and data analytics.

3. Accessing the System

3.1. Kerberos

For security purposes, you must have a current Kerberos Kerberos - authentication and encryption software required by the HPCMP to access HPC system login nodes and other resources. See Kerberos & Authentication ticket on your computer before attempting to connect to Warhawk. To obtain a ticket you must either install a Kerberos client kit on your desktop or connect via the HPC Portal. Visit the Kerberos & Authentication page for information about installing Kerberos clients on your Windows, Linux, or Mac desktop. Instructions are also available on those pages for getting a ticket and logging into the HPC systems from each platform.

3.2. Logging In

The system host name for the Warhawk cluster is warhawk.afrl.hpc.mil, which redirects you to one of seven login nodes. Hostnames and IP addresses to these nodes are available upon request from the HPC Help Desk.

The preferred way to login to Warhawk is via ssh, as follows: % ssh username@warhawk.afrl.hpc.mil

3.3. File Transfers

File transfers to DSRC systems (except for those to the local archive system) must be performed using the following HPCMP Kerberized tools: scp, mpscp, sftp, scampi, or tube. Windows users may use a graphical secure file transfer protocol (sftp) client such as FileZilla. See the HPC Help Desk Video on Using FileZilla. Before using any of these tools (except tube), you must use a Kerberos client to obtain a Kerberos ticket. Information about installing and using a Kerberos client can be found on the Kerberos & Authentication page.

The command below uses secure copy (scp) to copy a single local file into a destination directory on a Warhawk login node. The mpscp command is similar to the scp command, but it has a different underlying means of data transfer and may enable a greater transfer rate. The mpscp command has the same syntax as scp. % scp local_file username@warhawk.afrl.hpc.mil:/target_dir

Both scp and mpscp can be used to send multiple files. This command transfers all files with the .txt extension to the same destination directory. % scp *.txt username@warhawk.afrl.hpc.mil:/target_dir

The example below uses the secure file transfer protocol (sftp) to connect to Warhawk, then uses sftp's cd and put commands to change to the destination directory and copy a local file there. The sftp quit command ends the sftp session. Use the sftp help command to see a list of all sftp commands. % sftp username@warhawk.afrl.hpc.mil sftp> cd target_dir sftp> put local_file sftp> quit

4. User Environment

4.1. User Directories

The following user directories are provided for all users on Warhawk:

File Systems on Warhawk
Path Formatted Capacity File System Type Storage Type User Quota Minimum File Retention
/p/home ($HOME) 1.3 PB Lustre HDD 110 GB None
/p/work1 ($WORKDIR) 16.5 PB Lustre Hybrid SSD/HDD None 30 Days
/p/work2 4.9 PB Lustre Hybrid SSD/HDD None 30 Days
/p/cwfs ($CENTER) 3.3 PB GPFS HDD 100 TB 120 Days
/p/work1/projects ($PROJECTS_HOME) 336 TB Lustre HDD None None
4.1.1. Home Directory ($HOME)

When you log in, you are placed in your home directory, /p/home/username. It is accessible from the login and compute nodes and can be referenced by the environment variable $HOME.

Your home directory is intended for storage of frequently used files, scripts, and small utility programs. It has a 110-GB quota, and files stored there are not subject to automatic deletion based on age. It is backed up weekly to enable file restoration in the event of catastrophic system failure.

Important! The home file system is not tuned for parallel I/O and does not support application-level I/O. Jobs performing intensive file I/O in your home directory will perform poorly and cause problems for everyone on the system. Running jobs should use the work file system ($WORKDIR) for file I/O.

4.1.2. Work Directory ($WORKDIR)

The work file system is a large, high-performance Lustre-based file system tuned for parallel application-level I/O. It is accessible from the login and compute nodes and provides temporary file storage for queued and running jobs.

All users have a work directory, /p/work1/username, on this file system, which can be referenced by the environment variable, $WORKDIR. This directory should be used for all application file I/O. NEVER allow your jobs to perform file I/O in $HOME.

$WORKDIR has no quota. It is not backed up or exported to any other system and is subject to an automated deletion cycle. If available disk space gets too low, files that have not been accessed in 30 days may be deleted. If this happens or if catastrophic disk failure occurs, lost files are irretrievable. To prevent the loss of important files, transfer them to a long-term storage area, such as your archival directory ($ARCHIVE_HOME, see Archive Usage), which has no quota. Or, for smaller files, your home directory ($HOME).

Maintaining the high performance and stability of the Lustre file system is important for the efficient and effective use of Warhawk by all users. For example, setting stripe counts can maximize your performance and prevent you from filling up a single file system component causing system instability. Additional examples can be found in $SAMPLES_HOME/Data_Management/OST_Stripes on Warhawk.

To avoid errors that can arise from two jobs using the same scratch directory, a common technique is to create a unique subdirectory for each batch job. See Sample Scripts for an example of a script that does this.

4.1.3. Additional Work Directory (/p/work2)

This system includes an additional work directory, /p/work2. Contact the HPC Help Desk to request access to this file system.

4.1.4. Center Directory ($CENTER)

The Center-Wide File System (CWFS) is an NFS-mounted file system. It is accessible from the login nodes of all HPC systems at the center and from the HPC Portal. It provides centralized, shared storage that enables users to easily access data from multiple systems. The CWFS is not tuned for parallel I/O and does not support application-level I/O.

All users have a directory on the CWFS. The name of your directory may vary between systems and between centers, but the environment variable $CENTER always refers to this directory.

$CENTER has a quota of 100 TB. It is not backed up or exported to any other system and is subject to an automated deletion cycle. If available disk space gets too low, files that have not been accessed in 120 days may be deleted. If this happens or if catastrophic disk failure occurs, lost files are irretrievable. To prevent the loss of important files, transfer them to a long-term storage area, such as your archival directory ($ARCHIVE_HOME, see Archive Usage), which has no quota. Or, for smaller files, your home directory ($HOME).

4.1.5. Projects Directory ($PROJECTS_HOME)

The Projects directory, $PROJECTS_HOME, is a file system set aside for group-shared storage. It is intended for storage of semi-permanent files, similar to a home directory, but typically larger and shared by a group. It is not meant for high-speed application output ($WORKDIR, see Work Directory). A new project sub-directory can be created via an HPC Help Desk request and appears as follows: $PROJECTS_HOME/new_group_dir. The HPC Help Desk request must specify a UNIX group to be assigned to the project sub-directory. Users can create and manage UNIX groups in the Portal to the Information Environment, allowing the creator of the assigned group to manage the members of the group with access to the project sub-directory.

4.1.6. Specialized Temporary Directories

Each node includes several specialized directories.

The /tmp and /var/tmp directories are usually intended for temporary files as created by the operating system. Do not use these directories for your own files, as filling up these file systems can cause issues.

Warhawk also provides a "virtual" file system (i.e., "RAM disk") called /dev/shm which is local to each compute node. You may use this file system to store files in memory. It automatically increases in size as needed, up to half of the memory of the node. It is extremely fast, but it is also small and takes available node memory away from your application. An example use case is performing significant I/O with many small files when the memory is not otherwise needed by the application.

4.2. Shells

The following shells are available on Warhawk: csh, bash, ksh, tcsh, sh, and zsh.

To change your default shell, log into the Portal to the Information Environment and go to "User Information Environment" > "View/Modify personal account information". Scroll down to "Preferred Shell" and select your desired default shell. Then scroll to the bottom and click "Save Changes". Your requested change should take effect within 24 hours.

4.3. Environment Variables

A number of environment variables are provided by default on all HPCMP high performance computing (HPC) systems. We encourage you to use these variables in your scripts where possible. Doing so will help simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems.

4.3.1. Common Environment Variables

The following environment variables are automatically set in both your login and batch environments:

Common Environment Variables
Variable Description
$ARCHIVE_HOME Your directory on the archive system
$ARCHIVE_HOST The host name of the archive system
$BC_ACCELERATOR_NODE_CORES The number of CPU cores per node for a compute node which features CPUs and a hosted accelerator processor
$BC_BIGMEM_NODE_CORES The number of cores per node for a big memory (BIGMEM) compute node
$BC_CORES_PER_NODE The number of CPU cores per node for the node type on which the variable is queried
$BC_HOST The generic (not node specific) name of the system. Examples include centennial, mustang, onyx and gaffney
$BC_NODE_TYPE The type of node on which the variable is queried. Values of $BC_NODE_TYPE are: LOGIN, STANDARD, PHI, BIGMEM, BATCH, or ACCELERATOR
$BC_PHI_NODE_CORES The number of Phi cores per node, if the system has any Phi nodes. It will be set to 0 on systems without Phi nodes
$BC_STANDARD_NODE_CORES The number of CPU cores per node for a standard compute node
$CC The currently selected C compiler. This variable is automatically updated when a new compiler environment is loaded
$CENTER Your directory on the Center-Wide File System (CWFS)
$CSE_HOME The top-level directory for the Computational Science Environment (CSE) tools and applications
$CXX The currently selected C++ compiler. This variable is automatically updated when a new compiler environment is loaded
$DAAC_HOME The top level directory for the DAAC (Data Analysis and Assessment Center) supported tools
$F77 The currently selected Fortran 77 compiler. This variable is automatically updated when a new compiler environment is loaded
$F90 The currently selected Fortran 90 compiler. This variable is automatically updated when a new compiler environment is loaded
$HOME Your home directory on the system
$JAVA_HOME The directory containing the default installation of JAVA
$KRB5_HOME The directory containing the Kerberos utilities
$LOCALWORKDIR A high-speed work directory that is local and unique to an individual node, if the node provides such space
$PET_HOME The directory containing tools installed by PET staff, which are considered experimental or under evaluation. Certain older packages have been migrated to $CSE_HOME, as appropriate
$PROJECTS_ARCHIVE The directory on the archive system in which user-supported applications, code, and data may be kept
$PROJECTS_HOME The directory in which user-supported applications and codes may be installed
$SAMPLES_HOME A directory that contains the Sample Code Repository, a variety of sample codes and scripts provided by a center's staff
$WORKDIR Your work directory on the local temporary file system (i.e., local high-speed disk)
4.3.2. Batch-Only Environment Variables

In addition to the variables listed above, the following variables are automatically set only in your batch environment. That is, your batch scripts can see them when they run. These variables are supplied for your convenience and are intended for use inside your batch scripts.

Batch-Only Environment Variables
Variable Description
$BC_MEM_PER_NODE The approximate maximum memory (in integer MB) per node available to an end user program for the compute node type to which a job is being submitted
$BC_MPI_TASKS_ALLOC The number of MPI tasks allocated for a particular job
$BC_NODE_ALLOC The number of nodes allocated for a particular job
$JOBDIR Job-specific directory in $WORKDIR immune to scrubbing while job is active

Please refer to the Warhawk PBS Guide for a number of helpful environment variables provided during batch runs.

4.4. Archive Usage

All our HPC systems have access to an online archival mass storage system that provides long-term storage for users' files on a petascale tape file system that resides on a robotic tape library system. A 100-TB disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.

Tape file systems have very slow access times. The tapes must be robotically pulled from the tape library, mounted in one of the limited number of tape drives, and wound into position for file archival or retrieval. For this reason, users should always tar up their small files in a large tarball when archiving a significant number of files. A good size range for tarballs is about 500 GB - 1 TB. At that size, the time required for file transfer and tape I/O is reasonable. Files larger than 10 TB will greatly increase the time required for both archival and retrieval. Files larger than 19 TB will not be archived.

The environment variable $ARCHIVE_HOME is automatically set for you and can be used to reference your archive directory when using archive commands.

4.4.1. Archive Command Synopsis

A synopsis of the archive utility is listed below. For information on additional capabilities, see the AFRL DSRC Archive Guide or read the online man page available on each system. The archive command is non-Kerberized and can be used in batch submission scripts if desired.

Copy one or more files from the archive system: archive get [-C path] [-s] file1 [file2...]

List files and directory contents on the archive system: archive ls [lsopts] [file/dir ...]

Create directories on the archive system: archive mkdir [-C path] [-m mode] [-p] [-s] dir1 [dir2 ...]

Copy one or more files to the archive system: archive put [-C path] [-D] [-s] file1 [file2 ...]

5. Program Development

5.1. Modules

Software modules are a convenient way to set needed environment variables and include necessary directories in your path so commands for particular applications can be found. Warhawk also uses modules to initialize your environment with application software, system commands, libraries, and compiler suites.

A number of modules are loaded automatically as soon as you log in. To see the currently loaded modules, use the module list command. To see the entire list of available modules, use the module avail command. You can modify the configuration of your environment by loading and unloading modules. For complete information on how to do this and other information on using modules, see the AFRL DSRC Modules Guide.

5.2. Programming Models

Warhawk supports several parallel programming models. A programming model augments a programming language with parallel processing capability. Different programming models may use a different approach to express parallelism, such as message passing, threads, distributed memory, shared memory, etc.

Note, if an application is not programmed for distributed memory, then only the cores on a single node can be used. This is limited to 128 cores on Warhawk's standard nodes. See the Node Configuration table for core counts on other nodes.

Note, keep the system architecture in mind during code development. For instance, if your program requires more memory than is available on a single node, then you need to parallelize your code so it can function across multiple nodes.

Key supported programming models are discussed in each subsection below.

5.2.1. Message Passing Interface (MPI)

Warhawk's default MPI stack supports the MPI 3.1 Standard. MPI is part of the software support for parallel programming across a network of computer systems through a technique known as message passing. MPI establishes a practical, portable, efficient, and flexible standard for high-performance message passing. See man intro_mpi for additional information.

When creating an MPI program, ensure the default Programming Environment module (PrgEnv-cray), other Programming Environment modules (PrgEnv-[cray, intel, gnu, nvidia, aocc] ), or the HPE MPI Message Passing Toolkit (MPT) modules (mpt or hmpt) are loaded. To check this, run the module list command. To load the desired module, run the following command: module load PrgEnv-type where type is cray, intel, gnu, nvidia, or aocc. Or, module load MPT where MPT is mpt or hmpt. If using the MPT library, also load the desired compiler module: intel, gcc, cce, nvidia, or aocc.

Also, ensure the source code contains one of the following for the MPI library:

INCLUDE "mpif.h"        ## for older Fortran
USE mpi                 ## for newer Fortran
#include <mpi.h>        ## for C/C++

To compile an MPI program with the Cray MPICH library, use one of the following:

ftn -o MPI_executable mpi_program.f       ## for Fortran
cc -o MPI_executable mpi_program.c        ## for C
CC -o MPI_executable mpi_program.cpp      ## for C++

To compile an MPI program with the HPE MPT library, use one of the following:

mpif90 -f90=compiler -o MPI_executable mpi_program.f   ## for Fortran
  # where compiler can be ifort, gfortran, crayftn, nvfortran, flang

mpicc -cc=compiler -o MPI_executable mpi_program.c     ## for C
  # where compiler can be icc, gcc, craycc, nvc, clang

mpicxx -cxx=compiler -o MPI_executable mpi_program.cpp ## for C++
  # where compiler can be icpc, g++, crayCC, nvc++, clang++

For more information on compilers, compiler wrappers, and compiler options, see Available Compilers.

To run an MPI program within a batch script, load the same modules as used to compile the application.

In addition, for the HPE MPT suite only, use the following module commands: module unload PrgEnv-cray module load mpt module swap cray-pals cray-pals

Use one of the following commands to execute: mpiexec -n mpi_procs ./MPI_executable [user_arguments] aprun -n mpi_procs ./MPI_executable [user_arguments] where mpi_procs is the number of MPI processes being started. For example: #### The following starts 256 MPI processes #### (the placement of the processes on nodes is handled by the batch scheduler) mpiexec -n 256 ./MPI_executable

Although the commands mpiexec(1) and aprun(1) have some similar options, they are not interchangeable. DO NOT USE mpiexec_mpt (or mpirun, which links to mpiexec_mpt) on Warhawk with HPE MPT.

For more information about mpiexec or aprun, type man mpiexec or man aprun.

For more information on which MPI Standard features are supported by the default MPI on the system, check the BC MPI Test Suite page.

5.2.2. Open Multi-Processing (OpenMP)

OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications. It supports shared-memory multiprocessing programming in C, C++, and Fortran and consists of a set of compiler directives, library routines, and environment variables that influence compilation and run-time behavior.

When creating an OpenMP program, if using OpenMP functions (e.g., omp_get_wtime), ensure the source code includes one of the following lines:

INCLUDE "omp.h"        ## for older Fortran
USE omp_lib            ## for newer Fortran
#include <omp.h>       ## for C/C++

To compile an OpenMP program, ensure the desired compiler module is loaded. Use the following compiler commands and flags:

crayftn -homp -o OpenMP_executable openmp_program.f         ## for Cray Fortran
ifort -qopenmp -o OpenMP_executable openmp_program.f        ## for Intel Fortran
gfortran -fopenmp -o OpenMP_executable openmp_program.f     ## for GNU Fortran
flang -fopenmp -o OpenMP_executable openmp_program.f        ## for AOCC Fortran
nvfortran -mp -o OpenMP_executable openmp_program.f         ## for NVIDIA Fortran

craycc -fopenmp -o OpenMP_executable  openmp_program.c      ## for Cray C
icc -qopenmp -o OpenMP_executable openmp_program.c          ## for Intel C
gcc -fopenmp -o OpenMP_executable openmp_program.f          ## for GNU C
clang -fopenmp -o OpenMP_executable openmp_program.f        ## for AOCC C
nvc -mp -o OpenMP_executable openmp_program.f              ## for NVIDIA C

crayCC -fopenmp -o OpenMP_executable openmp_program.cpp     ## for Cray C++
icpc -qopenmp -o OpenMP_executable openmp_program.cpp       ## for Intel C++
g++ -fopenmp -o OpenMP_executable openmp_program.f          ## for GNU C++
clang++ -fopenmp -o OpenMP_executable openmp_program.f      ## for AOCC C++
nvc++ -mp -o OpenMP_executable openmp_program.f             ## for NVIDIA C++

For more information on compilers, compiler wrappers, and compiler options, see Available Compilers.

When running OpenMP applications, the $OMP_NUM_THREADS environment variable must be used to specify the number of threads. For example: #### run 32 threads on one node export OMP_NUM_THREADS=32 ./OpenMP_executable [user_arguments]

In the example above, the application starts the OpenMP_executable on one node and spawns a total of 32 threads. Since Warhawk has 128 cores per compute node, if you wanted to run one thread per core, you would set $OMP_NUM_THREADS to 128 instead.

5.2.3. Hybrid MPI/OpenMP

An application built with the hybrid model of parallel programming can run using both OpenMP and Message Passing Interface (MPI). This allows the application to run on multiple nodes yet leverages OpenMP's advantages within each node. In hybrid applications, multiple OpenMP threads are spawned by MPI processes, but MPI calls should not be issued from OpenMP parallel regions or by an OpenMP thread.

When creating a hybrid MPI/OpenMP program, follow the instructions in both the MPI and OpenMP sections above for creating your program.

To compile a hybrid program with Cray MPICH, use the MPI compiler wrappers in conjunction with the OpenMP options, as follows:

ftn -homp -o hybrid_executable hybrid_program.f         ## for Cray Fortran
ftn -qopenmp -o hybrid_executable hybrid_program.f      ## for Intel Fortran
ftn -fopenmp -o hybrid_executable hybrid_program.f      ## for GNU Fortran
ftn -fopenmp -o hybrid_executable hybrid_program.f      ## for AOCC Fortran
ftn -mp -o hybrid_executable hybrid_program.f           ## for NVIDIA Fortran

cc -fopenmp -o hybrid_executable  hybrid_program.c      ## for Cray C
cc -qopenmp -o hybrid_executable hybrid_program.c       ## for Intel C
cc -fopenmp -o hybrid_executable hybrid_program.f       ## for GNU C
cc -fopenmp -o hybrid_executable hybrid_program.f       ## for AOCC C
cc -mp -o hybrid_executable hybrid_program.f            ## for NVIDIA C

CC -fopenmp -o hybrid_executable hybrid_program.cpp     ## for Cray C++
CC -qopenmp -o hybrid_executable hybrid_program.cpp     ## for Intel C++
CC -fopenmp -o hybrid_executable hybrid_program.f       ## for GNU C++
CC -fopenmp -o hybrid_executable hybrid_program.f       ## for AOCC C++
CC -mp -o hybrid_executable hybrid_program.f            ## for NVIDIA C++

To compile a hybrid program with HPE MPT, use the MPI compiler wrappers and flags in conjunction with the OpenMP options, as follows:

mpif90 -f90=crayftn -homp -o hybrid_executable hybrid_program.f      ## for Cray Fortran
mpif90 -f90=ifort -qopenmp -o hybrid_executable hybrid_program.f     ## for Intel Fortran
mpif90 -f90=gfortran -fopenmp -o hybrid_executable hybrid_program.f  ## for GNU Fortran
mpif90 -f90=flang -fopenmp -o hybrid_executable hybrid_program.f     ## for AOCC Fortran
mpif90 -f90=nvfortran -mp -o hybrid_executable hybrid_program.f      ## for NVIDIA Fortran

mpicc -cc=craycc -fopenmp -o hybrid_executable  hybrid_program.c     ## for Cray C
mpicc -cc=icc -qopenmp -o hybrid_executable hybrid_program.c         ## for Intel C
mpicc -cc=gcc -fopenmp -o hybrid_executable hybrid_program.c         ## for GNU C
mpicc -cc=clang -fopenmp -o hybrid_executable hybrid_program.c       ## for AOCC C
mpicc -cc=nvc -mp -o hybrid_executable hybrid_program.c              ## for NVIDIA C

mpicxx -cxx=crayCC -fopenmp -o hybrid_executable hybrid_program.cpp  ## for Cray C++
mpicxx -cxx=icpc -qopenmp -o hybrid_executable hybrid_program.cpp    ## for Intel C++
mpicxx -cxx=g++ -fopenmp -o hybrid_executable hybrid_program.cpp     ## for GNU C++
mpicxx -cxx=clang++ -fopenmp -o hybrid_executable hybrid_program.cpp ## for AOCC C++
mpicxx -cxx=nvc++ -mp -o hybrid_executable hybrid_program.cpp        ## for NVIDIA C++

For more information on compilers, compiler wrappers, and compiler options, see Available Compilers.

When running hybrid MPI/OpenMP programs, use the MPI launcher as in MPI programs along with the $OMP_NUM_THREADS environment variable to specify the number of threads per MPI process. In the following example, four MPI processes will spawn eight threads each for a total of 32 threads: #### run 32 hybrid threads (4 MPI procs, 8 threads per proc) export OMP_NUM_THREADS=8 mpiexec -n 4 ./hybrid_executable [user_arguments]

Ensure the number of threads per node does not exceed the number of cores on each node. See the mpiexec and aprun man pages and the Batch Scheduling section for more detail on how MPI processes and threads are allocated on the nodes.

5.2.4. SHMEM

Logically shared, distributed-memory access (SHMEM) routines provide high-performance, high-bandwidth communication for use in highly parallelized scalable programs. The SHMEM data-passing library routines are similar to the MPI library routines; they pass data between cooperating parallel processes. The SHMEM data-passing routines can be used in programs that perform computations in separate address spaces and explicitly pass data to and from different processes in the program. Cray's SHMEM implementation is OpenSHMEMX. After loading the cray-openshmemx module, see the intro_shmem man page.

When creating a SHMEM program, load the desired compiler module (currently Cray, Intel, and GNU are supported) and load the cray-openshmemx module. Ensure the source code includes one of the following lines:

INCLUDE "mpp/shmem.fh"       ## for Fortran
#include <mpp/shmem.h>       ## for C/C++

To compile a SHMEM program, load the cray-openshmemx module and use the Cray MPICH Compiler wrappers:

ftn -o SHMEM_executable shmem_program.f     ## for Fortran
cc -o SHMEM_executable shmem_program.c      ## for C
CC -o SHMEM_executable shmem_program.cpp    ## for C++

For more information on compilers, compiler wrappers, and compiler options, see Available Compilers.

To run SHMEM applications, use the following command: aprun -n num_procs ./SHMEM_executable [user_arguments] The -n num_procs option indicates the number of processes to start, each process using one core.

5.2.5. Co-Array Fortran (CAF)

The Cray compiler supports Co-Array Fortran (CAF). This is a set of Partitioned Global Address Space (PGAS) extensions that lets you reference memory locations on any node without the need for message-passing protocols. This can greatly simplify writing and debugging parallel code.

To compile a CAF program, use the Cray compiler wrapper:

ftn -o CAF_executable -h caf caf_program.f     ## for Cray compiler

For more information on compilers, compiler wrappers, and compiler options, see Available Compilers. Other compilers not listed here may support CAF but may not be integrated to run across multiple nodes.

To run CAF applications, use the following command: aprun -n num_procs ./CAF_executable [user_arguments] The -n num_procs option indicates the number of processes to start, each process using one core.

Many users of PGAS extensions also use MPI or SHMEM calls in their codes. In such cases, be sure to use the appropriate include statements in your source code, as described in the respective sections above.

5.2.6. Unified Parallel C (UPC)

The Cray compiler supports Unified Parallel C (UPC). This is a set of Partitioned Global Address Space (PGAS) extensions that lets you reference memory locations on any node, without the need for message-passing protocols. This can greatly simplify writing and debugging a parallel code.

To compile a UPC program, use the Cray compiler wrapper:

cc -h upc -o UPC_executable upc_program.c     ## for Cray compiler

For more information on compilers, compiler wrappers, and compiler options, see Available Compilers. Other compilers not listed here may support UPC but may not be integrated to run across multiple nodes.

To run UPC applications, use the following command: aprun -n num_procs ./UPC_executable [user_arguments] The -n num_procs option indicates the number of processes to start, each process using one core.

Many users of PGAS extensions also use MPI or SHMEM calls in their codes. In such cases, be sure to use the appropriate include statements in your source code, as described in the corresponding sections above.

5.3. Available Compilers

Warhawk has five compiler suites:

  • Cray
  • Intel
  • GNU
  • NVIDIA
  • AMD Optimizing C/C++ compilers (AOCC)

The Cray compiler suite module is loaded by default.

Compiling can be affected by which MPI stack is being used. Warhawk has two MPI stacks:

  • Cray MPICH
  • HPE MPT

For more information about MPI, or if you are using another programming model besides MPI, see Programming Models above.

All versions of MPI share a common base set of compilers that are available on both the login and compute nodes. Codes running on the login nodes must be serial. The following table lists serial compiler commands for each language.

Serial Compiler Commands
Compiler Cray Intel GNU NVIDIA AOCC
C craycc icc gcc nvc clang
C++ crayCC icpc g++ nvc++ clang
Fortran 77 crayftn ifort gfortran nvfortran flang
Fortran 90 crayftn ifort gfortran nvfortran flang

Codes running on compute nodes may be serial or parallel. To compile parallel codes with Cray MPICH, use the PrgEnv-type modules (where type can be cray, intel, gnu, nvidia, aocc) and the following compiler wrappers:

Parallel MPICH Compiler Wrapper Commands
Compiler Cray Intel GNU NVIDIA AOCC
C cc cc cc cc cc
C++ CC CC CC CC CC
Fortran 77 ftn ftn ftn ftn ftn
Fortran 90 ftn ftn ftn ftn ftn

To compile parallel codes with HPE MPT, use the mpt or hmpt modules and the following compiler wrappers:

Parallel HPE MPT Compiler Wrapper Commands
Compiler Cray Intel GNU NVIDIA AOCC
C mpicc
-cc=craycc
mpicc
-cc=icc
mpicc
-cc=gcc
mpicc
-cc=nvc
mpicc
-cc=clang
C++ mpicxx
-cxx=crayCC
mpicxx
-cxx=icpc
mpicxx
-cxx=g++
mpicxx
-cxx=nvc++
mpicxx
-cxx=clang++
Fortran 77 mpif90
-f90=crayftn
mpif90
-f90=ifort
mpif90
-f90=gfortran
mpif90
-f90=nvfortran
mpif90
-f90=flang
Fortran 90 mpif90
-f90=crayftn
mpif90
-f90=ifort
mpif90
-f90=gfortran
mpif90
-f90=nvfortran
mpif90
-f90=flang

For more information about compiling with MPI, see Programming Models above.

5.3.1. Cray Compiler Environment

HPE Cray Programming environment has C, C++, and Fortran compilers that are designed to extract increased performance from the systems, regardless of the underlying architecture. This compiler can be loaded with the PrgEnv-cray module for Cray MPICH or the craype module with HPE MPT. The following table lists some of the more common options you may use:

Common Cray Compiler Options
Option Purpose
-c Generate intermediate object file but do not attempt to link
-I directory Search in directory for include or module files
-L directory Search in directory for libraries
-o outfile Name executable "outfile" rather than the default "a.out"
-Olevel Set the optimization level. For more information on optimization, see the sections on Compiler Optimization and Code Profiling
-g Generate symbolic debug information
-fPIC Generate position-independent code for shared libraries
-f free Process Fortran codes using free form
-m0 Reports detailed information about code optimizations to stdout as compile proceeds
-f openmp Recognize OpenMP directives (C/C++)
-homp Recognize OpenMP directives (Fortran)
-hdynamic Compiling using shared objects
-K traps=fp Trap floating point, divide by zero, and overflow exceptions

Detailed information about these and other compiler options is available in the Cray compiler (craycc, crayCC, crayftn) man pages.

5.3.2. Intel Compiler Environment

The Intel compiler is a highly optimizing compiler typically producing very fast executables for Intel processors. This compiler can be loaded with the PrgEnv-intel module for Cray MPICH or the intel module with HPE MPT. The following table lists some of the more common options you may use:

Common Intel Compiler Options
Option Purpose
-c Generate intermediate object file but do not attempt to link
-I directory Search in directory for include or module files
-L directory Search in directory for libraries
-o outfile Name executable "outfile" rather than the default "a.out"
-Olevel Set the optimization level. For more information on optimization, see the sections on Compiler Optimization and Code Profiling
-g Generate symbolic debug information
-fPIC Generate position-independent code for shared libraries
-ip Single-file interprocedural optimization. See the sections on Compiler Optimization and Code Profiling
-ipo Multi-file interprocedural optimization. See the sections on Compiler Optimization and Code Profiling
-free Process Fortran codes using free form
-convert big_endian Big-endian files; the default is little-endian
-qopenmp Recognize OpenMP directives
-Bdynamic Compiling using shared objects
-fpe-all=0 Trap floating point, divide by zero, and overflow exceptions

Detailed information about these and other compiler options is available in the Intel compiler (ifort, icc, and icpc) man pages.

5.3.3. GNU Compiler Collection (GCC)

The GCC Programming Environment is a popular open-source compiler typically found on all Linux systems and generally works in a compatible manner across these systems. It provides many options that are the same for all compilers in the suite. This compiler can be loaded with the PrgEnv-gnu module for Cray MPICH or the gcc module for HPE MPT. The following table lists some of the more common options you may use:

Common GCC Compiler Options
Option Purpose
-c Generate intermediate object file but do not attempt to link
-I directory Search in directory for include or module files
-L directory Search in directory for libraries
-o outfile Name executable "outfile" rather than the default "a.out"
-Olevel Set the optimization level. For more information on optimization, see the sections on Compiler Optimization and Code Profiling
-g Generate symbolic debug information
-fPIC Generate position-independent code for shared libraries
-fconvert=big=endian Read/write big-endian files; the default is for little-endian
-Wextra -Wall Turns on increased error reporting

Detailed information about these and other compiler options is available in the GNU compiler (gfortran, gcc, and g++) man pages.

5.3.4. AOCC Compiler Environment

The AMD Optimizing C/C++ Compiler (AOCC) compiler system is a high performance, production quality code generation tool. The AOCC environment provides various options to users when building and optimizing C, C++, and Fortran applications. AOCC uses LLVM's Clang as the compiler and driver for C and C++ programs, and Flang as the compiler and driver for Fortran programs. This compiler can be loaded with the PrgEnv-aocc module for Cray MPICH or the aocc module for HPE MPT. The following table lists some of the more common options you may use:

Common AOCC Compiler Options
Option Purpose
-c Generate intermediate object file but do not attempt to link
-I directory Search in directory for include or module files
-L directory Search in directory for libraries
-o outfile Name executable "outfile" rather than the default "a.out"
-Olevel Set the optimization level. For more information on optimization, see the sections on Compiler Optimization and Code Profiling
-g Generate symbolic debug information
-ffree-form Compile free form Fortran

Detailed information about these and other compiler options is available in the AOCC compiler (clang, clang++, and flang) man pages.

5.3.5. NVIDIA Compiler Environment

The NVIDIA HPC Software Development Kit (SDK) is a comprehensive suite of compilers and libraries enabling users to program the entire HPC platform from the GPU to the CPU and through the interconnect. The NVIDIA HPC SDK C, C++, and Fortran compilers support GPU acceleration of HPC modeling and simulation applications with standard C++ and Fortran, OpenACC directives and CUDA. This compiler can be loaded with the PrgEnv-nvidia module for Cray MPICH or the nvidia module for HPE MPT. The following table lists some of the more common options you may use:

Common NVIDIA Compiler Options
Option Purpose
-c Generate intermediate object file but do not attempt to link
-I directory Search in directory for include or module files
-L directory Search in directory for libraries
-o outfile Name executable "outfile" rather than the default "a.out"
-Olevel Set the optimization level. For more information on optimization, see the sections on Compiler Optimization and Code Profiling
-g Generate symbolic debug information
-fPIC Generate position-independent code for shared libraries
-acc Enable parallelization using OpenACC directives. By default, the compilers will parallelize and offload OpenACC regions to an NVIDIA GPU
-gpu Control the type of GPU for which code is generated, the version of CUDA to be targeted, and several other aspects of GPU code generation
-Mfree Compile free form Fortran
-Minfo=acc Prints diagnostic information to STDERR regarding whether the compiler was able to produce GPU code successfully

Detailed information about these and other compiler options is available in the NVIDIA compiler (nvc, nvc++, and nvfortran) man pages.

5.4. Libraries

Several scientific and math libraries are available on Warhawk. The libraries provided by the vendor and/or compiler are typically faster than the open-source equivalents (CSE).

5.4.1. Cray Scientific and Math Libraries (CSML) LibSci

The Cray Scientific and Math Libraries (CSML, also known as LibSci) is a collection of numerical routines optimized for best performance on Cray systems. All programming environment modules load cray-libsci by default, except when noted.

Most users, on most codes, find better performance by using calls to Cray LibSci routines in their applications instead of calls to public domain or user-written versions.

Note, additionally, Cray EX systems make use of the Cray LibSci Accelerator routines for enhanced performance on GPU-equipped compute nodes. For more information, see the intro_libsci_acc man page.

The LibSci collection, available in C and Fortran, contains the following scientific libraries:

  • Basic Linear Algebra Subroutines (BLAS) - Levels 1, 2, and 3
  • C interface to the legacy BLAS (CBLAS)
  • Basic Linear Algebra Communication Subprograms (BLACS)
  • Linear Algebra Package (LAPACK)
  • Scalable LAPACK (ScaLAPACK) (distributed-memory parallel set of LAPACK routines)
  • Fast Fourier Transform (FFT)
  • Fastest Fourier Transform in the West Routines (FFTW versions 2 and 3)
  • Accelerated BLAS and LAPACK routines (LibSci_ACC)

Two libraries unique to Cray are also included:

  • Iterative Refinement Toolkit (IRT)
  • CrayBLAS (library of BLAS routines autotuned for Cray EX series)

The IRT routines may be used by setting the environment variable $IRT_USE_SOLVERS to 1 or by coding an explicit call to an IRT routine. Additional information is available by using the man intro_irt command.

To link to the Cray LibSci libraries, add the entry -lsci_compiler[_mpi][_mp] where compiler can be cray, intel, gnu, nvidia, or aocc, and, optionally, _mpi and _mp select MPI-enabled and multithreaded versions, respectively.

5.4.2. Intel Math Kernel Library (MKL)

Warhawk provides the Intel Math Kernel Library (Intel MKL), a set of numerical routines tuned specifically for Intel platform processors and optimized for math, scientific, and engineering applications. The routines, which are available via both Fortran and C interfaces, include:

  • LAPACK plus BLAS (Levels 1, 2, and 3)
  • ScaLAPACK plus PBLAS (Levels 1, 2, and 3)
  • Fast Fourier Transform (FFT) routines for single-precision, double-precision, single-precision complex, and double-precision complex data types
  • Discrete Fourier Transforms (DFTs)
  • Fast Math and Fast Vector Library
  • Vector Statistical Library Functions (VSL)
  • Vector Transcendental Math Functions (VML)

The MKL routines are part of the Intel Programming Environment as Intel's MKL is bundled with the Intel Compiler Suite.

Linking to the Intel Math Kernel Libraries can be complex and is beyond the scope of this introductory guide. Documentation explaining the full feature set along with instructions for linking can be found at the Intel Math Kernel Library documentation page.

Intel also makes a link advisor available to assist users with selecting proper linker and compiler options: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html.

5.4.3. Other Cray-supplied Libraries

The following Cray-optimized libraries are also available:

  • FFTW - Discrete Fourier Transform libraries
  • HDF5 - Hierarchical Data Format library (serial and parallel)
  • NETCDF - Network Common Data Format library (serial and parallel)

The modulefiles for these libraries are of the form cray-library_name. Use module avail cray- to find the appropriate modulefile. More information about linking these libraries is in the documentation, which is available after loading the associated modulefiles.

5.4.4. Additional Libraries

There is also an extensive set of math, I/O, and other libraries available in the $CSE_HOME directory on Warhawk. Information about these libraries can be found on the Baseline Configuration website at BC policy FY13-01 and the CSE Quick Reference Guide.

5.5. Debuggers

Warhawk has the following debugging tools: Cray Debugger Support Tools (CDST), Forge DDT, and GNU Project Debugger (gdb). These debugging tools range from simple command-line debuggers to separately licensed third-party GUI tools. They can perform a variety of tasks ranging from analyzing core files to setting breakpoints and debugging running parallel programs. As a rule, your code must be compiled using the -g command-line option.

For in-depth training using debuggers, visit the PET Knowledge Management Learning System and search for "debug" or use the following search link.

5.5.1. Forge DDT

DDT supports threads, MPI, OpenMP, C/C++, Fortran, Co-Array Fortran, UPC, and CUDA. Memory debugging and data visualization are supported for large-scale parallel applications. The Parallel Stack Viewer is a unique way to see the program state of all processes and threads at a glance.

DDT is a graphical debugger; therefore, you must be able to display it via a UNIX X-Windows interface. There are several ways to do this including SSH X11 Forwarding, HPC Portal, or SRD. Follow the steps below to use DDT via X11 Forwarding or Portal.

  1. Choose a remote display method: X11 Forwarding, HPC Portal, or SRD. X11 Forwarding is easier but typically very slow. HPC Portal requires no extra clients and is typically fast. SRD requires an extra client but is typically fast and may be a good option if doing a significant amount of X11 Forwarding.
    1. To use X11 Forwarding:
      1. Ensure an X server is running on your local system. Linux users will likely have this by default, but MS Windows users need to install a third-party X Windows solution. There are various options available.
      2. For Linux users, connect to Warhawk using ssh -Y. Windows users need to use PuTTY with X11 forwarding enabled (Connection->SSH->X11->Enable X11 forwarding).
    2. Or to use HPC Portal:
      1. Navigate to https://centers.hpc.mil/portal.
      2. Select HPC Portal at AFRL.
      3. Select XTerm | AFRL | Warhawk.
    3. Or, for information on using SRD, see the SRD User Guide.
  2. Compile your program with the -g option.
  3. Submit an interactive job, as in the following example: qsub -l select=1:ncpus=128:mpiprocs=128 -A Project_ID -l walltime=00:30:00 -q debug -X -I
  4. Load the Forge DDT module: module load forge
  5. Start program execution: ddt -n 4 ./my_mpi_program arg1 arg2 ... (Example for four MPI ranks)
  6. The DDT window will pop up. Verify the application name and number of MPI processes. Click "Run".

An example of using DDT can be found in $SAMPLES_HOME/Programming/DDT_Example on Warhawk. For more information on using DDT, see the Forge User's Manual. There is also a PET course available: Debugging and Optimizing Parallel Codes with Arm Forge (MAP and DDT).

5.5.2. GNU Project Debugger (gdb)

The gdb debugger is a source-level debugger that can be invoked either with a program for execution or a running process id. It is serial-only. To launch your program under gdb for debugging, use the following command: gdb a.out corefile

To attach gdb to a program that is already executing on a node, use the following command: gdb a.out pid

For more information, the GDB manual can be found at http://www.gnu.org/software/gdb.

5.5.3. Cray Debugger Support Tools (CDST)

Cray provides a collection of debugging packages that include the following:

Gdb4hpc
gdb4hpc is a GDB-based parallel debugger used to debug applications compiled with CCE, Intel, and GNU C, C++, and Fortran compilers. It allows users to either launch an application or attach to an already-running application. This debugger can be accessed by loading the gdb4hpc module. Detailed information about this debugger can be found in the gdb4hpc man page on Warhawk.

Valgrind4hpc
Valgrind4pc is a Valgrind-based debugging tool used to detect memory leaks and errors in parallel applications. Valgrind4hpc aggregates any duplicate messages across ranks to help provide an understandable picture of program behavior. This tool can be accessed by loading the valgrind4hpc module. Detailed information can be found in the valgrind4hpc man page on Warhawk.

Stack Trace Analysis Tool (STAT)
STAT is a single merged stack backtrace tool to analyze application behavior at the function level. It helps trace down the cause of crashes. This tool can be accessed by loading the cray-stat module. Detailed information can be found in the STAT man page on Warhawk.

Abnormal Termination Processing (ATP)
ATP is a scalable core file generation and analysis tool for determining what causes a code to crash. It can be accessed by loading the atp module. Detailed information can be found in the atp man page on Warhawk.

Cray Comparative Debugger (CCDB)
CCDB is Cray's next generation debugging tool. It features a GUI interface that extends the comparative debugging capabilities of gdb, enabling users to easily compare data structures between two executing applications. This tool can be accessed by loading the cray-ccdb module. Detailed information can be found in the ccdb man page on Warhawk.

5.6. Code Profiling

Profiling is the process of analyzing the execution flow and characteristics of your program to identify sections of code that are likely candidates for optimization, which increases the performance of a program by modifying certain aspects for increased efficiency.

We provide several profiling tools: Forge MAP, gprof, and codecov to assist in the profiling process. In addition, a basic overview of optimization methods with information about how they may improve the performance of your code can be found in the Techniques for Improving Performance guide.

For in-depth training on using profiling tools, visit the PET Knowledge Management Learning System and search for "profiling" or use the following search link.

5.6.1. Forge MAP

The MAP profiler is a scalable, low-overhead tool to display how your program is spending its time and potentially reveal the causes of slow performance. It profiles C, C++, Fortran, and Python with no relinking, instrumentation, or code changes (though you must compile with -g). It also works with MPI (potentially at large scales), OpenMP, threads, and I/O.

To use MAP, load the Forge module: module load forge

MAP can be used interactively or in offline mode. To use it interactively, follow the interactive job instructions 1-4 from the Forge DDT Section but run the map command instead of ddt. Detailed optimization information is now in Techniques for Improving Performance module, and start program execution as follows: map -n 4 ./my_mpi_program arg1 arg2 ... (Example for four MPI ranks)

Using MAP in offline mode produces a profile (.map) output file that can be analyzed later, at your convenience, and without an actively running (potentially long, very large core-hour) job. This is more efficient for profiling at larger scales. To use MAP in offline mode, modify your batch script to include the forge module and run your application as follows: map --profile -n 4 ./my_mpi_program arg1 arg2 ... (Example for four MPI ranks)

You may view the resulting .map file in the Forge GUI. This can be done on a login node on Warhawk in Forge by following the interactive job instructions 1-4 from the Forge DDT Section and skipping step 3. Or you can download a free client from the Linaro Forge site, transfer the .map file, and open it on your local system. Note, the Forge client version must match the version of Forge on Warhawk.

For more information on using MAP, see the Forge User's Manual. There is also a PET course available: Debugging and Optimizing Parallel Codes with Arm Forge (MAP and DDT).

5.6.2. GNU Project Profiler (gprof)

The gprof profiler shows how your program is spending its time and which function calls are made. It works best for serial codes but can be used for small parallel codes, though it will not provide MPI or threaded information.

To profile code using gprof, use the -pg option during compilation. It will automatically generate profile information when executed. Use the gprof command to view the profile information. See man gprof on Warhawk or the gprof web site for more information.

5.6.3. Additional Profiling Tools

There is also a set of profiling tools available in CSE. Information about these tools may be found on the Baseline Configuration website at BC policy FY13-01 and the CSE Quick Reference Guide.

5.7. Compiler Optimization Options

The -Olevel option enables code optimization when compiling. The level you choose (0-4 depending upon the compiler) determines how aggressive the optimization will be. Increasing levels of optimization may increase performance significantly but may also cause a loss of precision. There are additional options that may enable further optimizations. The following table contains the most commonly used options.

Compiler Optimization Options
Option Purpose Compiler Suite
-O0 No Optimization. (default in GNU) All
-O1 Scheduling within extended basic blocks is performed. Some register allocation is performed. No global optimization All
-O2 Level 1 plus traditional scalar optimizations such as induction recognition and loop invariant motion are performed by the global optimizer. Generally safe and beneficial. (default in Cray and Intel) All
-O3 Levels 1 and 2 plus more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable. Generally beneficial All
-fipa-* The GNU compilers automatically enable IPA at various -O levels. To set these manually, see the options beginning with -fipa in the gcc man page GNU
-finline-functions Enables function inlining within a single file Intel
-ip Enables interprocedural optimization within single files at a time Intel
-ipon Enables interprocedural optimization between files and produces up to n object files (default: n=0) Intel
-inline-level=n Number of levels of inlining (default: n=2) Intel
-opt-reportn Generate optimization report with n levels of detail Intel
-xHost Generate code with the highest vector instruction set available on the processor Intel
-fp-model model Used to tune the float-point optimizations, typically to override -On. -O3 uses model=fast which may be considered too imprecise for scientific codes, so often -O3 is used in conjunction with -fp-model precise, consistent, or strict Intel

6. Batch Scheduling

6.1. Scheduler

The Portable Batch System (PBS) is currently running on Warhawk. It schedules jobs, manages resources and job queues, and can be accessed through the interactive batch environment or by submitting a batch request. PBS can manage both single-processor and multiprocessor jobs. The appropriate module is automatically loaded for you when you log in. This section is merely a brief introduction to PBS; please see the Warhawk PBS Guide for more details.

6.2. Queue Information

The following table describes the PBS queues available on Warhawk:

Queue Descriptions and Limits on Warhawk
Priority Queue Name Max Wall Clock Time Max Cores Per Job Max Queued Per User Max Running Per User Description
Highest urgent 168 Hours 69,888 N/A N/A Jobs belonging to DoD HPCMP Urgent Projects
Down arrow for decreasing priority debug 1 Hour 2,816 15 4 Time/resource-limited for user testing and debug purposes
high 168 Hours 69,888 N/A N/A Jobs belonging to DoD HPCMP High Priority Projects
frontier 168 Hours 69,888 N/A N/A Jobs belonging to DoD HPCMP Frontier Projects
standard 168 Hours 69,888 N/A N/A Standard jobs
HIE 24 Hours 256 N/A 2 Rapid response for interactive work. For more information see the HPC Interactive Environment (HIE) User Guide.
transfer 48 Hours 1 N/A 12 Data transfer for user jobs. Not charged against project allocation. See the AFRL DSRC Archive Guide, section 5.2.
Lowest background 24 Hours 2,816 N/A 35 User jobs that are not charged against the project allocation

6.3. Interactive Logins

When you log in to Warhawk, you will be running in an interactive shell on a login node. The login nodes provide login access for Warhawk and support such activities as compiling, editing, and general interactive use by all users. Please note the AFRL DSRC Login Node Abuse policy. The preferred method to run resource-intensive interactive executions is to use an interactive batch session (see Interactive Batch Sessions below).

6.4. Batch Request Submission

PBS batch jobs are submitted via the qsub command. The format of this command is: qsub [ options ] batch_script_file qsub options may be specified on the command line or embedded in the batch script file by lines beginning with #PBS. Some of these options are discussed in Batch Resource Directives below. The batch script file is not required for interactive batch sessions (see Interactive Batch Sessions).

For a more thorough discussion of PBS Batch Submission, see the Warhawk PBS Guide.

6.5. Batch Resource Directives

Batch resource directives allow you to specify how your batch jobs should be run and the resources your job requires. Although PBS has many directives, you only need to know a few to run most jobs.

PBS directives can be specified in your batch script or on the command line. The syntax for a batch file is as follows: #PBS -directive1 [option1[=value1]] #PBS -directive2 [option2[=value2]] ...

The syntax for the command line is as follows: qsub -directive1 [option1[=value1]] -directive2 [option2[=value2]] ...

Some options may require values. For example, to start a 32-process job, request one node of 128 cores and specify 32 processes per node, as follows: #PBS -l select=1:ncpus=128:mpiprocs=32

If no batch file is specified, then all required directives must be specified on the command line, as follows: qsub -l select=N1:ncpus=128:mpiprocs=N2:Nodetype -A Project_ID -q Queue_Name -l walltime=HHH:MM:SS ...

You must specify the desired maximum walltime (HHH:MM:SS), Project_ID, and Queue_Name. The number of nodes requested (N1) defaults to 1. The number of processes per node (N2) can range from 1 to 128 and defaults to 128.

Note, command-line use is required for interactive batch sessions (see Interactive Batch Sessions) since no batch file is specified.

The Nodetype parameter is optional. To specify the node type on which your job will run, select the associated directive option from the following table:

Node Type Directives
Node Type Directive Option
Standard (standard node is the default, no directive required)
Large-memory bigmem=1
Visualization (1 GPU) ngpus=1
MLA (2 GPUs) ngpus=2

For example, to request three large-memory nodes: #PBS -l select=3:ncpus=128:mpiprocs=128:bigmem=1

To request an MLA (2-GPU) node: #PBS -l select=1:ncpus=128:mpiprocs=128:ngpus=2

The following directives are required for all jobs:

Required PBS Directives
Directive Description
-A Project_ID Name of the project (defaults to $ACCOUNT if not specified)
-q Queue_Name Name of the queue
-l walltime=HHH:MM:SS Maximum wall time in hours, minutes, and seconds
-l select=#:ncpus=128 Select sets the number of requested nodes.
ncpus specifies the number of cores available on the node

The following directives are optional but are commonly used:

Optional Directives
Directive Description
-l select=#:ncpus=128:mpiprocs=#:Nodetype A variant of the required "select" directive above. The "mpiprocs" and "Nodetype" options are optional.
mpiprocs specifies the number of processes per node.
Nodetype specifies the type of node (see Node Type Directives above)
-N Job_Name Name of the job
-e File_Name Redirect standard error to the named file
-o File_Name Redirect standard output to the named file
-j oe Merge standard error and standard output into standard output
-l application=Application_Name Identify the application being used. See $SAMPLES_HOME/Application_Name/application_names on Warhawk
-I Request an interactive batch shell
-V Export all environment variables to the job
-v Variable_List Export specific environment variables to the job

A more complete listing of batch resource directives is available in the Warhawk PBS Guide.

6.6. Interactive Batch Sessions

An interactive batch session allows you to run interactively (in a command shell) on a compute node after waiting in the batch queue.

To use the interactive batch environment, you must first acquire an interactive batch shell. This is done by adding the -I option to your qsub command. For example: qsub your_pbs_options -I -X

The PBS options for your job are described in Batch Resource Directives above. The -X option enables X-Windows access, so it may be omitted if your interactive job does not use a GUI.

Your interactive batch sessions will be scheduled as normal batch jobs are scheduled depending on the other queued batch jobs, so it may take some time. Once your interactive batch shell starts, you will be logged into the first compute node of those assigned to your job. At this point, you can run or debug interactive applications, execute job scripts, post-process data, etc. You can launch parallel applications on your assigned compute nodes by using an MPI or other parallel launch command.

The HPC Interactive Environment (HIE) provides an HIE queue that is specifically for interactive jobs. It offers longer job times and has nodes reserved only for HIE, so queue wait times are sometimes much shorter. However, HIE usually has limitations, such as only allowing the use of a single node at a time. See the HIE User Guide for more information.

6.7. Launch Commands

There are different commands for launching parallel executables, including MPI, from within a batch job depending on which MPI implementation or other parallel library your code uses. See the Programming Models section for more information on launching executables within a batch session.

6.8. Sample Scripts

The following example is a good starting template for a batch script to run a serial job for one hour:

#!/bin/bash
# The line above specifies the shell to use for parsing the script.
#
# Specify name of the job                   (Optional Directive)
#PBS -N serialjob
#
# Append std output to file serialjob.out   (Optional Directive)
#PBS -o serialjob.out
#
# Append std error to file serialjob.err    (Optional Directive)
#PBS -e serialjob.err
#
# Specify Project ID to be charged          (Required Directive)
#PBS -A Project_ID
#
# Request wall clock time of 1 hour         (Required Directive)
#PBS -l walltime=01:00:00
#
# Specify queue name                        (Required Directive)
#PBS -q standard
#
# Specify the number of cores               (Required Directive)
#PBS -l select=1:ncpus=128:mpiprocs=1
#
# Change to the specified directory, in this case, the user's work directory
cd $WORKDIR
#
# Execute the serial executable on 1 core
./serial_executable
# End of batch job

The first few lines tell PBS to save the standard output and error output to the given files and give the job a name. Skipping ahead, we estimate the run-time to be about one hour, which we know is acceptable for the standard batch queue. We need one core in total, so we request one core. The resource allocation is one full 128-core node for exclusive use by the job.

Important! Except for jobs in the transfer queue, which use shared nodes, jobs on standard nodes are charged for full 128-core nodes, even if you do not use all cores on the node.

The following example is a good starting template for a batch script to run a parallel (MPI) job for two hours:

#!/bin/bash
#
## Required PBS Directives --------------------------------------
#PBS -A Project_ID
#PBS -q standard
#PBS -l select=2:ncpus=128:mpiprocs=128
#PBS -l walltime=02:00:00
#
## Optional PBS Directives --------------------------------------
#PBS -N Test_Run_1
#PBS -j oe
#PBS -V
#
## Execution Block ----------------------------------------------
# Environment Setup
# Get sequence number of unique job identifier
JOBID=`echo $PBS_JOBID | cut -d '.' -f 1`
# create and cd to job-specific directory in your personal directory
# in the scratch file system ($WORKDIR/$JOBID)
mkdir $WORKDIR/$JOBID
cd $WORKDIR/$JOBID
#
# Launching
# copy executable from $HOME and execute it with a .out output file
cp $HOME/my_mpi_program .
mpiexec -n 256 ./my_mpi_program > my_mpi_program.out
#
# Don't forget to archive and clean up your results (see the AFRL DSRC Archive Guide for details)

We estimate the run time to be about two hours, which we know is acceptable for the standard batch queue. The optional PBS lines tell PBS to combine the standard output and error output, give the job a name, and import all environmental variables. This job is requesting 256 total cores and 128 cores per node allowing the job to run on two nodes. The default value for number of cores per node is 128.

A common concern for MPI users is the need for more memory for each process. By default, one MPI process is started on each core of a node. This means on Warhawk standard nodes, the available memory on the node is split 128 ways. To allow an individual process to use more of the node's memory, you need to start fewer processes on each node. To do this, you must request more nodes from PBS but run on fewer cores on each. For example, the following select statement requests four nodes with 128 cores per node, but it only uses 16 of those cores for MPI processes on each node:

#!/bin/bash
#
#### Starts 64 MPI processes; only 16 on each node
#PBS -l select=4:ncpus=128:mpiprocs=16
#PBS -A Project_ID
#PBS -q standard
#PBS -l walltime=02:00:00
#
## execute on 4 nodes, total of 64 MPI processes across all - 16 on each node.
mpiexec -n 64 ./a.out
#
# Don't forget to archive and clean up your results (see the AFRL DSRC Archive Guide for details)

Further sample scripts can be found in the Warhawk PBS Guide and in the Sample Code Repository ($SAMPLES_HOME) on the system. There is also an extensive discussion in the AFRL DSRC Archive Guide of sample scripts to perform data staging in the transfer queue using chained batch scripts to archive and clean up your work directory results files.

6.9. PBS Commands

The following commands provide the basic functionality for using the PBS batch system:

Submit jobs for batch processing: qsub [qsub_options] my_job_script

Check the status of submitted jobs:

qstat JOBID             ##check one job
qstat -u my_user_name   ##check all of your jobs

Kill queued or running jobs: qdel JOBID

A more complete list of PBS commands is available in the Warhawk PBS Guide.

6.10. Determining Time Remaining in a Batch Job

Knowing the time remaining before the batch system will kill a job lets you write restart files or even prepare input for the next job submission. However, adding such capability to an existing source code requires knowledge to query the batch system as well as parsing the resulting output to determine the amount of remaining time.

The DoD HPCMP allocated systems now have the library, WLM_TIME, as an easy way to provide the remaining time in the batch job to C, C++, and Fortran programs. The library can be added to your job using the following:

For C: #include <wlm_time.h> void wlm_time_left(long int *seconds_left)

For C++: extern "C" { #include <wlm_time.h> } void wlm_time_left(long int *seconds_left)

For Fortran: SUBROUTINE WLM_TIME_LEFT(seconds_left) INTEGER seconds_left

For simplicity, wall-clock-time remaining is returned as an integer value of seconds.

To simplify usage, a module file defines the process environment, and a pkg-config metadata file defines the necessary compiler linker options:

For C: module load wlm_time $(CC) ctest.c `pkg-config --cflags --libs wlm_time`

For C++: module load wlm_time $(CXX) Ctest.C `pkg-config --cflags --libs wlm_time`

For Fortran: module load wlm_time $(F90) test.f90 `pkg-config --cflags-only-I --libs wlm_time`

WLM_TIME works currently with PBS. The developers expect WLM_TIME will continue to provide a uniform interface encapsulating the underlying aspects of the workload management system.

6.11. Advance Reservations

A subset of Warhawk's nodes has been set aside for use as part of the Advance Reservation Service (ARS). The ARS allows users to reserve a user-designated number of nodes for a specified number of hours starting at a specific date/time. This service enables users to execute interactive or other time-critical jobs within the batch system environment. The ARS is accessible on the web at https://reservation.hpc.mil. Authentication is required. For more information, see the ARS User Guide.

7. Software Resources

7.1. Application Software

A complete list of the software versions installed on Warhawk can be found on the software page. The general rule is that the two latest versions of all COTS software packages are maintained on our systems. For convenience, modules are also available for most COTS software packages. The following are other available software-related services:

  • The Software License Buffer provides access to commercial software licenses on compute nodes. See the SLB User Guide.
  • Singularity is the approved software for running and building containers. Containers allow you to deploy or use applications with all their software dependencies packaged together. See the Introduction to Singularity.
  • The HPCMP Portal is a web interface for several graphics and web-based applications. It also includes virtual desktops for most HPC systems. See the HPC Portal Page.
  • The Secure Remote Desktop (SRD) is a client-based VNC virtual desktop application that supports graphical acceleration on GPU nodes for intensive visualization. See the SRD User Guide.
  • GitLab is a web-based source code management platform. See the GitLab User Guide.

7.2. Useful Utilities

The following utilities are available on Warhawk. For command-line syntax and examples of usage, please see each utility's online man page.

Baseline Configuration and Other Useful Commands and Tools
Name Description
archive Perform basic file-handling operations on the archive system
bcmodule An enhanced version of the standard module command
check_license Check the status of licenses for HPCMP shared applications
cqstat Display information about running and pending batch jobs
mpscp High-performance remote file copy
node_use Display the amount of free and used memory for login nodes
qflag Report a problem with a batch job to the HPC Help Desk
qpeek Display spooled stdout and stderr for an executing batch job
qview Display information about batch jobs and queues
scampi Transfer data between systems using multiple streams and sockets
show_queues Report current batch queue status, usage, and limits
show_storage Display disk/file usage and quota information
show_usage Display CPU allocation and usage by subproject
tube Copy files to a remote system using Kerberos host authentication

7.3. Sample Code Repository

The Sample Code Repository is a directory that contains examples for COTS batch scripts, building and using serial and parallel programs, data management, and accessing and using serial and parallel math libraries. The $SAMPLES_HOME environment variable contains the path to this area and is automatically defined in your login environment. Below is a listing of the examples provided in the Sample Code Repository on Warhawk.

Sample Code Repository on Warhawk
Application_Name
Use of the application name resource.
Sub-DirectoryDescription
application_namesREADME and list of valid strings for application names intended for use in every PBS script preamble. The HPCMP encourages applications not specifically named in the list to be denoted as "other".
Applications
Application-specific examples; interactive job submit scripts; use of the application name resource; software license use.
abaqusInstructions for using the abaqus automatic batch script generator as well a sample input deck and sample script for running abaqus jobs. The abaqus module must be loaded.
accelrysInstructions for using the accelrys automatic batch script generator as well a sample input deck and sample script for running accelrys jobs. The accelrys module must be loaded.
ale3dInstructions, sample PBS job script, and sample input data file for executing the ALE3D application on Mustang. The ale3d module needs to be loaded prior to use.
ansysInstructions for using the ansys automatic batch script generator as well a sample input deck and sample script for running ansys jobs. The ansys module must be loaded.
cart3dA series of examples to learn how to use the CART3D application. Follow the instructions in README.txt. Sample PBS jobs scripts are provided as well as sample input data files. Membership in the Unix group wpcart3d is required.
cfd++Instructions for using the cfd++ automatic batch script generator as well sample input data and sample script for running cfd++ jobs. The cfd++ module must be loaded.
CFXInstructions for using the CFX automatic batch script generator as well a sample input deck and sample script for running CFX jobs. The cfx module must be loaded.
cobaltInstructions for using the cobalt automatic batch script generator. Sample job script for the COBALT application. Tar files for two test cases. The cobalt module must be loaded.
cthInstructions and PBS submission script to execute CTH jobs. The cth module should be loaded prior to use.
espressoInstructions and PBS submission script to execute espresso jobs. The espresso module should be loaded prior to use.
fieldviewInstructions and PBS submission script to execute fieldview jobs. The fieldview module should be loaded prior to use.
fluentInstructions for using the fluent automatic batch script generator. Sample job script for the fluent application. The fluent module must be loaded.
fun3dREADME file with instructions on how to execute the FUN3D application. A PBS job script and example input files in a tar file are provided. Membership in the Unix group fun3d is required to use the tar file.
gamessInstructions for using the gamess automatic batch script generator, input data file, and sample PBS job script for the gamess application. The gamess module should be loaded.
gaspBrief instructions, two sample PBS job scripts, and two sample input data files to execute the GASP application. The GASP module must be loaded to run the application. Membership in the Unix group wpgasp is required to be able to use the input data tar files.
gaspexPBS job script and sample data archive for using the gaspex application. The gaspex module should be loaded.
gaussianInstructions for executing the gaussian PBS job script generation tool to run gaussian on Mustang. Also gaussian PBS job script and input data file.
lsdynaInstructions for using the ls-dyna automatic batch script generator. Sample job script for the LS_DYNA application. The lsdyna module should be loaded.
matlabInstructions to execute an interactive MATLAB job and a .m script to execute in it. A matlab module should be loaded.
ncarInstructions on how to use the NCAR Graphics tool. The appropriate ncar module must be loaded beforehand.
openfoamSample PBS job scripts to execute the OPENFOAM application. Instructions on using OPENFOAM are in the scripts. In some cases, the openfoam module must be loaded.
sierraInstructions on how to run the SIERRA application. Membership in the sierra group is required to use sierra.
sqlite3Instructions on how to use the SQLITE3 database tool.
starccm+Instructions to use "submit_starccm" to create a PBS job script for starccm+, plus input data files PBS job scripts that have already been generated. One of the starccm modules should be loaded prior to use.
singularityAn OS-level virtualization, "container", technology. README presenting one or more container examples.
subversionREADME presenting instructions and other information on how to use the SUBVERSION tool to organize software versions.
vaspSample input file and PBS job script for the VASP application.
Data_Management
Archiving and retrieving files; Lustre striping; file searching; $WORKDIR use.
MPSCP_to_ARCHIVEInstructions and sample scripts on using the mpscp utility for transferring files to/from file archive.
Lustre_FS_StripesInstructions and examples for striping large files on the Lustre file systems.
Postprocess_ExampleExample showing how to submit a post-processing script at the end of a parallel computation job to do such things as tar data and store it off of temporary storage to archive storage.
Transfer_Queue_ExamplePBS batch script examples for data transfer using the transfer queue.
Transfer_Queue_with_Archive_CommandsExample and instructions on recommended best practice to stage data from mass storage using the transfer queue prior to job execution, then processing using that data, then passing output data back to mass storage using the transfer queue again.
Parallel_Environment
MPI, OpenMP, and hybrid examples; single-core jobs; large memory jobs; running multiple applications within a single batch job.
Hello_World_ExampleExamples of hello world codes using MPI, OpenMP and a Hybrid code using MPI with OpenMP threads.
Hybrid_ExampleSample Fortran and C codes and makefile for compiling hybrid MPI/OpenMP codes, and sample scripts for running hybrid MPI/OpenMP jobs.
Large_Memory_JobsExample PBS script and README instructing correct queue for jobs to execute on the big-memory nodes on Mustang.
mpicInstructions for running multiple serial processes on multiple cores.
Mix_Serial_with_ParallelDemonstration of how to use $PBS_NODEFILE to set several host lists to run serial tasks and parallel tasks on different nodes in the same PBS job. Scripts are provided to demonstrate techniques using both SGI's MPT and Intel's IMPI.
MPI_ExampleSample code and makefile for compiling MPI code to run on Mustang, and sample script for running MPI jobs.
Multiple_exec_one_communicatorExample using 3 binaries that shows how to compile and execute a heterogeneous application on Mustang. Scripts are provided for using mpirun in SGI's MPT and mpirun in Intel's IMPI.
Multiple_ParallelDemonstration of how to set up and run multiple MPI tasks on different nodes in the same PBS job. Sampels are provided demonstrating use of mpirun in SGI's MPT and mpirun in Intel's IMPI.
OpenMP_ExampleSample C code and makefile for compiling OpenMP code to run on Mustang, and sample scripts for running OpenMP jobs.
Serial_Processing_1C and Fortran serial program examples and sample scripts for running multiple instances of a serial program across nodes using ssh.
Serial_Processing_2Fortran serial program example and sample scripts for running multiple serial compute tasks simultaneously across cores of a compute node.
Programming
Basic code compilation; debugging; use of library files; static vs. dynamic linking; Makefiles; Endian conversion.
BLACS_ExampleSample BLACS Fortran program, compile script and PBS submission script. The BLACS are from Netlib's ScaLAPACK library in $PET_HOME.
Core_FilesInstructions and source code for viewing core files with Totalview and gdb. This sample uses the Gnu compilers.
ddtInstructions and sample programs for using the DDT debugger.
Endian_ConversionText file discussing what to do to be able to use binary data generated on a non-X86_64 platform.
Intel_IMPI_ExampleDemonstration how to manipulate the modules and compile and execute code using Intel's IMPI.
Memory_UsagePresents a routine callable from Fortran or C used to determine how much memory a process is using.
MKL_BLACS_ExampleSample BLACS Fortran program, compile script and PBS submission script. The BLACS are from Intel's Math Kernel Library (MKL).
MKL-ScaLAPACK_ExampleSample ScaLAPACK Fortran program, compile script and PBS job script. The ScaLAPACK solver, BLACS communication, and supporting LAPACK and BLAS routines are all from Intel's MKL.
MPI_CompilationDiscussion of how to compile MPI codes on Mustang using Intel and Gnu compilers and linking in Intel's IMPI or SGI's MPT. Includes notes on support for each MPI by the available compilers.
Open_Files_LimitDiscussion and demonstration of the maximum number of simultaneously open files a single process may have.
ScaLAPACK_ExampleSample ScaLAPACK Fortran program, compile script and PBS submission scripts. The linear solver routines are from Netlib's ScaLAPACK library in $PET_HOME. The LAPACK routines are from Netlib's LAPACK library in $PET_HOME, and the BLAS are from Netlib's LAPACK library in $COST_HOME.
SO_CompileSample Shared Object compilation information, including demonstration of how to compile and assemble a dynamically loaded, ie. shared, library.
Timers_FortranSerial Timers using Fortran Intrinsics f77 and f90/95.
TotalView_ExampleREADME, compilation script, sample code, and instructions on how to set up and execute an interactive TotalView debugging session on Mustang.
User_Environment
Use of modules; customizing the login environment; use of common environment variables to facilitate portability of work between systems.
modulesSample README, module description file, and module template for creation of software modules on Mustang.
Module_Swap_ExampleBatch script demonstrating use of several module commands to choose specific modules within a PBS job.
Workload_Management
Basic batch scripting; use of the transfer queue; job arrays; job dependencies; Secure Remote Desktop; job monitoring.
Batchscript_ExampleSimple PBS batch script showing all required preamble statements and a few optional statements. More advanced batch script showing more optional statements and a few ways to set up PBS jobs. Description of the system hardware. Process placement is described in a subdirectory.
Core_Info_ExampleDescription and C language routine suitable for Fortran and C showing how to determine the node and core placement information for MPI, OpenMP, and hybrid MPI/OpenMP PBS jobs.
Hybrid_ExampleSample Fortran and C codes and makefile for compiling hybrid MPI/OpenMP codes, and sample scripts for running hybrid MPI/OpenMP jobs.
Interactive_ExampleC and Fortran code samples and scripts for running interactive jobs on Mustang. The sample code is an MPI "Hello World".
Job_Array_ExampleSample code to generate binary and data and job script for using job arrays.
Job_Dependencies_ExampleExample code, scripts, and instructions demonstrating how to set up a job dependency for jobs depending on how one or more other jobs execute, or to perform some action that one or more other jobs require before execution.
MPI_ExampleSample code and makefile for compiling MPI code to run on Mustang, and sample script for running MPI jobs.
OpenMP_ExampleSample C code and makefile for compiling OpenMP code to run on Mustang, and sample scripts for running OpenMP jobs.
PE_Pinning_ExampleExamples demonstrating how to place and/or pin job processing elements, either MPI processes or OpenMP threads, to cores or groups of cores to facilitate more efficient processing and prevent separation of an execution thread from its data and instructions. All combinations of compiler (Intel or Gnu) and MPI (SGI MPT or Intel IMPI) are discussed.
Transfer_Queue_ExamplePBS batch script examples for data transfer using the transfer queue.

8.1. HPE Cray Links

HPE Home: https://www.hpe.com

8.2. SUSE Links

SUSE Home: https://suse.com

SUSE Linux Enterprise Server: https://suse.com/products/server

8.3. GNU Links

GNU Home: https://www.gnu.org

GNU Compiler: https://gcc.gnu.org/onlinedocs

8.4. Intel Links

Intel Home: https://www.intel.com

Intel Documentation: https://software.intel.com/en-us/intel-software-technical-documentation

Intel Compiler List: https://software.intel.com/en-us/intel-compilers

8.5. AOCC Links

AOCC Documentation: https://www.amd.com/en/developer/aocc.html

8.6. NVIDIA Links

NVIDIA Documentation: https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html

8.7. Debugger Links

Forge Documentation: https://www.linaroforge.com/documentation

9. Glossary

Batch Job
:
a single request for a set of compute nodes along with a set of tasks (usually in the form of a script) to perform on those nodes
Batch-scheduled
:
users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available
Compute Node
:
a node that performs computational tasks for the user. There may be multiple types of compute nodes for specialized purposes.
Distributed Memory Model
:
a programming methodology where memory is distributed across multiple nodes giving processes on each node faster direct access to local memory, but requiring slower techniques such as message passing to access memory on other nodes
Interconnect
:
a specialized, very high-speed network that connects the nodes of an HPC system together. It is typically used for application inter-process communication (e.g., message passing) and I/O traffic.
Kerberos
:
authentication and encryption software required by the HPCMP to access HPC system login nodes and other resources. See Kerberos & Authentication
Login Node
:
a node that serves as the user's entry point into an HPC system
Node
:
an individual server in a cluster or collection of servers of an HPC system
Parallel File System
:
a specialized, high-speed storage system for an HPC system capable of scaling up to higher speeds for larger HPC workloads
Shared Memory Model
:
a programming methodology where a set of processors (such as the cores within one node) have direct access to a shared pool of memory