SCOUT User Guide
Table of Contents
- 1. Introduction
- 1.1. Document Scope and Assumptions
- 1.2. Policies to Review
- 1.3. Obtaining an Account
- 1.4. Requesting Assistance
- 2. System Configuration
- 2.1. System Summary
- 2.2. Processors
- 2.3. Memory
- 2.4. Operating System
- 2.5. File Systems
- 2.6. Peak Performance
- 3. Accessing the System
- 3.1. Kerberos
- 3.2. Logging In
- 3.3. File Transfers
- 4. User Environment
- 4.1. User Directories
- 4.2. Shells
- 4.3. Environment Variables
- 4.4. Modules
- 4.5. Archive Usage
- 4.6. Login Files
- 5. Program Development
- 5.1. Programming Models
- 5.2. Available Compilers
- 5.3. Relevant Modules
- 5.4. Libraries
- 5.5. Debuggers
- 5.6. Code Profiling and Optimization
- 6. Batch Scheduling
- 6.1. Scheduler
- 6.2. Queue Information
- 6.3. Interactive Logins
- 6.4. Interactive Batch Sessions
- 6.5. Batch Request Submission
- 6.6. Launch Commands
- 6.7. Sample Script
- 6.8. LSF Commands
- 7. Software Resources
- 7.1. Application Software
- 7.2. Useful Utilities
- 7.3. Sample Code Repository
- 8. Links to Vendor Documentation
- 8.1. IBM Links
- 8.2. Red Hat Link
- 8.3. GNU Links
- 8.4. PGI Links
- 9. Glossary
1. Introduction
1.1. Document Scope and Assumptions
This document provides an overview and introduction to the use of the IBM Power9 (SCOUT) located at the ARL DSRC along with a description of the specific computing environment on SCOUT. The intent of this guide is to provide information that enables the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:
- Use of the Linux operating system
- Use of an editor (e.g., vi or emacs)
- Remote usage of computer systems via network
- A selected programming language and its related tools and libraries
1.2. Policies to Review
Users are expected to be aware of the following policies for working on SCOUT.
1.2.1. Login Node Abuse Policy
Memory or CPU intensive programs running on the login nodes can significantly affect all users of the system. Therefore, only small applications requiring a minimal amount of runtime and memory are allowed on the login nodes. Any job running on the login nodes that affects their overall interactive performance may be unilaterally terminated.
1.2.2. Workspace Purge Policy
The /work1 directory is subject to a 21-day purge policy. A system "scrubber" monitors scratch space utilization, and if available space becomes low, files not accessed within 21 days are subject to removal, although files may remain longer if the space permits. There are no exceptions to this policy.
Note: If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, you will not be notified prior to deletion. You are responsible to monitor your workspace to prevent data loss.
1.3. Obtaining an Account
The process of getting an account on the HPC systems at any of the DSRCs begins with getting an account on the HPCMP Portal to the Information Environment, commonly called a "pIE User Account." If you do not yet have a pIE User Account, please visit the Obtaining An Account page and follow the instructions there. Once you have an active pIE User Account, visit our Services section for instructions on how to request accounts on the ARL DSRC HPC systems. If you need assistance with any part of this process, please contact the HPC Help Desk at accounts@helpdesk.hpc.mil.
1.4. Requesting Assistance
The HPC Help Desk is available to help users with unclassified problems, issues, or questions. Analysts are on duty 8:00 a.m. - 8:00 p.m. Eastern, Monday - Friday (excluding Federal holidays).
- Service Portal: https://helpdesk.hpc.mil/hpc
- E-mail: help@helpdesk.hpc.mil
- Phone: 1-877-222-2039 or (937) 255-0679
For more information about requesting assistance, see the HPC Help Desk dropdown.
For after-hours support and for support services not provided by the HPC Help Desk, you can contact the ARL DSRC in any of the following ways:
- E-mail: dsrchelp@arl.army.mil
- Phone: 1-800-ARL-1552 (1-800-275-1552) or (410) 278-1700
- Fax: (410) 278-5075
- U.S. Mail:
U.S. Army Research Laboratory
ATTN: FCDD-RLC-S
6791 Aberdeen Blvd.,
Aberdeen Proving Ground, MD 21005
2. System Configuration
2.1. System Summary
SCOUT is an IBM Power 9 system with 22 nodes for machine learning training workloads, each with two IBM Power 9 processors, 690 GB of system memory, 6 NVIDIA V100 GPU processing units with 32 GB of high-bandwidth memory each and 12 TB of local solid-state storage. SCOUT also has 128 GPGPU-accelerated nodes for inferencing workloads, each with two IBM Power 9 processors, 4 NVIDIA T4 GPU's, 246 GB of system memory, and 2.1 TB of local solid-state storage. There are also 2 visualization nodes with two IBM Power 9 processors, 502 GB of system memory, 2 NVIDIA V100 GPU Processing units, and 5.9 TB of local solid-state storage.
SCOUT is intended as a batch-scheduled Batch-scheduled - users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available HPC system. Its login nodes Login Node - a node that serves as the user's entry point into an HPC system are not to be used for large computational (memory, IO, long executions) work. All executions that require large amounts of system resources must be sent to the training or inference nodes by batch job submission.
Login | Training | Inference | Visualization | |
---|---|---|---|---|
Total Nodes | 4 | 22 | 128 | 2 |
Processor | IBM POWER9 | IBM POWER9 | IBM POWER9 | IBM POWER9 |
Processor Speed | 2.55 GHz | 2.55 GHz | 2.55 GHz | 2.55 GHz |
Sockets / Node | 2 | 2 | 2 | 2 |
Cores / Node | 40 | 40 | 40 | 40 |
Total CPU Cores | 160 | 880 | 5,120 | 80 |
Usable Memory / Node | 502 GB | 690 GB | 246 GB | 502 GB |
Accelerators / Node | None | 6 | 4 | 2 |
Accelerator | N/A | NVIDIA V100 PCIe 3 | NVIDIA T4 PCIe 3 | NVIDIA V100 PCIe 3 |
Memory / Accelerator | N/A | 32 GB | 16 GB | 16 GB |
Storage on Node | 1.4 TB PCIe | 12 TB PCIe | 2.1 TB PCIe | 5.9 TB PCIe |
Interconnect | InfiniBand EDR | InfiniBand EDR | InfiniBand EDR | InfiniBand EDR |
Operating System | RHEL | RHEL | RHEL | RHEL |
Path | Formatted Capacity | File System Type | Storage Type | User Quota | Minimum File Retention |
---|---|---|---|---|---|
/p/home ($HOME) | 132 TB | GPFS | HDD | None | None |
/p/work1 ($WORKDIR) | 1.2 PB | GPFS | HDD | None | 21 Days |
/p/scratch1 ($LSCRATCH) | Varies by node type | XFS | Varies by node type | None | None |
/p/cwfs ($CENTER) | 3.3 PB | GPFS | HDD | None | 120 Days |
/p/app ($PROJECTS_HOME) | 90 TB | GPFS | HDD | None | None |
2.2. Processors
SCOUT uses 2.6-GHz IBM Power9 processors on its nodes. There are two processors per node, each with 20 cores, for a total of 40 cores per node. In addition, these processors have a level 3 (L3) cache of 120 MB.
2.3. Memory
SCOUT uses both shared Shared Memory Model - a programming methodology where a set of processors (such as the cores within one node) have direct access to a shared pool of memory and distributed Distributed Memory Model - a programming methodology where memory is distributed across multiple nodes giving processes on each node faster direct access to local memory, but requiring slower techniques such as message passing to access memory on other nodes memory models. Memory is shared among all the cores on a node but is not shared among the nodes across the cluster.
Each login node contains 512 GB of main memory. All memory and cores on the node are shared among all users who are logged in. Therefore, users should not use excessive amounts of memory at any one time.
Each of the 22 training compute nodes Compute Node - a node that performs computational tasks for the user. There may be multiple types of compute nodes for specialized purposes. has six NVIDIA V100 GPUs with 32 GB memory and 512 GB of user-accessible shared memory. Each of the 128 inference nodes has four NVIDIA T4 GPUs with 16 GB memory and 256 GB of user-accessible shared memory. Each of the two visualization nodes has two NVIDIA V100 GPU with 32 GB of memory and contains 512 GB of user-accessible shared memory.
2.4. Operating System
The operating system on SCOUT is Red Hat Linux.
2.5. File Systems
SCOUT has the following file systems available for user storage:
2.5.1. /p/home
This file system is locally mounted from SCOUT's GPFS file system. It has a formatted capacity of 155 TB. All users have a home directory located on this file system which can be referenced by the environment variable $HOME.
2.5.2. /p/scratch1
SCOUT has on-node temporary storage space (/p/scratch1) available on each node as follows: Login nodes: 1.4 TB SSD, Training nodes: 12 TB NVMe, Inference nodes: 2.1 TB SSD, and Visualization nodes: 5.9 TB NVMe. All users can access this space, and it can be referenced by the environment variable $LSCRATCH. Warning, this space is short-term temporary storage and is cleared as needed by the system.
2.5.3. /p/work1
This directory comprises SCOUT's user scratch file area and is a locally mounted GPFS file system. /p/work1 has a formatted capacity of 1.045 PB. All users have a work directory located on /p/work1 which can be referenced by the environment variable $WORKDIR.
2.5.4. /p/app
All center-managed COTS packages are stored in /p/app. This file system is locally mounted from SCOUT's GPFS file system. It has a formatted capacity of 90 TB and can be referenced by the environment variable $CSI_HOME. In addition, users may request space in this area under /p/app/unsupported to store user-managed software packages they wish to make available to other owner-designated users. This area can be referenced by the environment variable $PROJECTS_HOME. To have space allocated in /p/app/unsupported, submit a request to the ARL DSRC Help Desk. Send e-mail to dsrchelp@arl.army.mil or call 1-800-ARL-1552 (1-800-275-1552) or (410) 278-1700.
2.5.5. /archive
This NFS-mounted file system is accessible from the login nodes on SCOUT. Files in this file system are subject to migration to tape and access may be slower due to the overhead of retrieving files from tape. It has a formatted capacity of 16 TB with a petascale archival tape storage system. The disk portion of the file system is automatically backed up. Users should migrate all large input and output files to this area for long-term storage. Users should also migrate all important smaller files from their home directory in /p/home to this directory for long-term storage. All users have a directory located on this file system which can be referenced by the environment variable $ARCHIVE_HOME.
2.5.6. /tmp or /var/tmp
Never use /tmp or /var/tmp for temporary storage! These directories are not intended for temporary storage of user data, and abuse of these directories could adversely affect the entire system.
2.5.7. /p/cwfs
This path is directed to the Center-Wide File System (CWFS) which is meant for short-term storage (no longer than 120 days). All users have a directory defined in this file system. The environment variable for this is $CENTER. This is accessible from the unclassified HPC system login nodes. The CWFS has a formatted capacity of 3.3 PB and is managed by IBM's Spectrum Scale (formerly GPFS).
2.6. Peak Performance
SCOUT is rated at 1.2 peak PFLOPS.
3. Accessing the System
3.1. Kerberos
A Kerberos Kerberos - authentication and encryption software required by the HPCMP to access HPC system login nodes and other resources. See Kerberos & Authentication client kit must be installed on your desktop system to get a Kerberos ticket. Kerberos is a network authentication tool that provides secure communication by using secret cryptographic keys. Only users with a valid HPCMP Kerberos authentication can gain access to SCOUT. More information about installing Kerberos clients on your desktop can be found at the Kerberos & Authentication page.
3.2. Logging In
The system host name for SCOUT is scout.arl.hpc.mil, which will redirect the user to one of four login nodes. Hostnames and IP addresses to these nodes are available upon request from the HPC Help Desk.
The preferred way to login to SCOUT is via ssh, as follows:
% ssh username@scout.arl.hpc.mil
3.3. File Transfers
File transfers to ARL DSRC systems (except for those to the local archive server) must be performed using Kerberized versions of the following tools: scp, mpscp, sftp, or scampi.
The command below uses secure copy (scp) to copy a single
local file into a destination directory on a SCOUT login node. The
mpscp command is similar to the scp command but
has a different underlying means of data transfer and may enable greater
transfer rates. The mpscp command has the same syntax as
scp.
% scp local_file username@scout.arl.hpc.mil:/target_dir
Both scp and mpscp can be used to send multiple
files. This command transfers all files with the .txt extension to the same
destination directory.
% scp *.txt username@scout.arl.hpc.mil:/target_dir
The example below uses the secure file transfer protocol (sftp) to connect
to SCOUT, then uses the sftp cd and put commands
to change to the destination directory and copy a local file there. The
sftp quit command ends the sftp session. Use the sftp
help command to see a list of all sftp commands.
% sftp username@scout.arl.hpc.mil
sftp> cd target_dir
sftp> put local_file
sftp> quit
Windows users may use a graphical file transfer protocol (ftp) client such as FileZilla.
4. User Environment
4.1. User Directories
4.1.1. Home Directory
When you log on to SCOUT, you will be placed in your home directory, /p/home/username. The environment variable $HOME is automatically set for you and refers to this directory. $HOME is visible to both the login and compute nodes and may be used to store small user files. However, it has limited capacity and is not backed up daily and therefore, should not be used for long-term storage.
4.1.2. Work Directory
The path for your working directory on SCOUT's scratch file system is /p/work1/username. The environment variable $WORKDIR is automatically set for you and refers to this directory. $WORKDIR is visible to both the login and compute nodes and should be used for temporary storage of active data related to your batch jobs.
Note: Although the $WORKDIR environment variable is automatically set for you, the directory itself may not be created. If not, you can create your $WORKDIR directory as follows:
mkdir $WORKDIR
The scratch file system provides 1.2 PB of formatted disk space. This space is not backed up and is subject to a purge policy.
REMEMBER: This file system is considered volatile working space. You are responsible for archiving any data you wish to preserve. To prevent your data from being "scrubbed," you should copy files you want to keep into your /archive directory (see below) for long-term storage.
4.1.3. Archive Directory
In addition to $HOME and $WORKDIR, each user is also given a directory on the /archive file system. This file system is visible to the login nodes (not the compute nodes) and is the preferred location for long-term file storage. All users have an area defined in /archive for personal use, which can be accessed using the $ARCHIVE_HOME environment variable. We recommend you keep large computational files and more frequently accessed files in the $ARCHIVE_HOME directory. We also recommend any important files located in $HOME should be copied into $ARCHIVE_HOME as well.
Because the compute nodes are unable to see $ARCHIVE_HOME, you need to pre-stage your input files to your $WORKDIR from a login node before submitting jobs. After jobs complete, you need to transfer output files from $WORKDIR to $ARCHIVE_HOME from a login node. This may be done manually or through the transfer queue, which executes serial jobs on login nodes.
4.1.4. Center-Wide File System Directory
The Center-Wide File System (CWFS) provides file storage that is accessible from SCOUT's login nodes and from the HPC Portal. The CWFS allows for file transfers and other file and directory operations from SCOUT using standard Linux commands. Each user has their own directory in the CWFS. The name of your CWFS directory may vary between machines and between centers, but the environment variable $CENTER will always refer to this directory.
The example below shows how to copy a file from your work directory on SCOUT to the CWFS ($CENTER).
While logged into SCOUT, copy your file from your work directory to the CWFS.
% cp $WORKDIR/filename $CENTER
4.2. Shells
The following shells are available on SCOUT: csh, bash, ksh, tcsh, zsh, and sh. To change your default shell, go to the Portal to the Information Environment (pIE). First select OpenID Login, then select either CAC Login or YubiKey. After logging in, select the User Information Environment tab. Click "View/Modify personal account information" and scroll down to "Preferred Shell", where you can change your preferred shell via a drop-down menu. After selecting a preferred shell, be sure to click Save Changes, at the bottom of the information section. Within 24 hours your preferred shell will become your default shell on SCOUT and all other clusters where you have an account.
4.3. Environment Variables
A number of environment variables are provided by default on all HPCMP HPC systems. We encourage you to use these variables in your scripts where possible. Doing so helps to simplify your scripts and reduce portability issues if you ever need to run those scripts on other systems. The following environment variables are common to both the login and batch environments:
Variable | Description |
---|---|
$ARCHIVE_HOME | Your directory on the archive server. |
$ARCHIVE_HOST | The host name of the archive server. |
$BC_HOST | The generic (not node specific) name of the system. |
$CC | The currently selected C compiler. This variable is automatically updated when a new compiler environment is loaded. |
$CENTER | Your directory on the Center-Wide File System (CWFS). |
$CSI_HOME | The directory containing the following list of heavily used application packages: ABAQUS, Accelrys, ANSYS, CFD++, EnSight, Fluent, GASP, Gaussian, LS-DYNA, and MATLAB, formerly known as the Consolidated Software Initiative (CSI) list. Other application software may also be installed here by our staff. |
$CXX | The currently selected C++ compiler. This variable is automatically updated when a new compiler environment is loaded. |
$DAAC_HOME | The directory containing DAAC-supported visualization tools: ParaView, VisIt, and EnSight. |
$F77 | The currently selected Fortran 77 compiler. This variable is automatically updated when a new compiler environment is loaded. |
$F90 | The currently selected Fortran 90 compiler. This variable is automatically updated when a new compiler environment is loaded. |
$HOME | Your home directory on the system. |
$JAVA_HOME | The directory containing the default installation of JAVA. |
$KRB5_HOME | The directory containing the Kerberos utilities. |
$PET_HOME | The directory containing the tools formerly installed and maintained by the PET staff. This variable is deprecated and will be removed from the system in the future. Certain tools will be migrated to $COST_HOME, as appropriate. |
$PROJECTS_HOME | A common directory where group-owned and supported applications and codes may be maintained for use by members of a group. Any project may request a group directory under $PROJECTS_HOME. |
$SAMPLES_HOME | The Sample Code Repository. This is a collection of sample scripts and codes provided and maintained by our staff to help users learn to write their own scripts. There are ready-to-use scripts for a variety of applications. |
$WORKDIR | Your work directory on the local temporary file system (i.e., local high-speed disk). |
4.4. Modules
Software modules are a very convenient way to set needed environment variables and include necessary directories in your path so commands for particular applications can be found. We strongly encourage you to use modules. For more information on using modules, see the ARL DSRC Modules Guide.
4.5. Archive Usage
Archive storage is provided through the $ARCHIVE_HOME NFS-mounted file system. All users are automatically provided a directory under this file system; however, it is only accessible from the login nodes. Since space in a user's login home area in /p/home is limited, all large data files requiring permanent storage should be placed in $ARCHIVE_HOME. Also, it is recommended that all important smaller files in /p/home for which a user requires long-term access be copied to $ARCHIVE_HOME as well. For more information on using the archive system, see the ARL DSRC Archive Guide.
4.6. Login Files
When an account is created on SCOUT, a default .cshrc, and/or .profile file is placed into your home directory. This file contains the default modules setup to configure modules, LSF and other system defaults. We suggest you customize the following: .cshrc.pers or .profile.pers for your shell with any paths, aliases, or libraries you may need to load. The files should be sourced at the end of your .cshrc and/or .profile file as necessary. For example:
if (-f $HOME/.cshrc.pers) then source $HOME/.cshrc.pers endif
5. Program Development
5.1. Programming Models
SCOUT supports two programming models: Message Passing Interface (MPI) and Open Multi-Processing (OpenMP). A hybrid (MPI/OpenMP) programming model is also supported. MPI is an example of a message- or data-passing model. OpenMP only uses shared memory on a node by spawning threads. And the hybrid model combines both models.
5.1.1. Message Passing Interface (MPI)
SCOUT has two MPI-3.0 standard library suites: IBM Spectrum and OpenMPI. The modules for these MPI libraries are mpi/spectrum/10.03 and mpi/openmpi/latest.
5.1.2. Open Multi-Processing (OpenMP)
SCOUT supports OpenMP through all its programming environments. OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications. It supports shared-memory multiprocessing programming in C, C++, and Fortran, and consists of a set of compiler directives, library routines, and environment variables that influence compilation and run-time behavior.
5.1.3. Hybrid Processing (MPI/OpenMP)
In hybrid processing, all intranode parallelization is accomplished using OpenMP, while all internode parallelization is accomplished using MPI. Typically, there is one MPI task assigned per node, with the number of OpenMP threads assigned to each node set at the number of cores available on the node.
5.2. Available Compilers
SCOUT has two compiler suites:
- PGI
- GNU
All versions of MPI share a common base set of compilers available on both the login and compute nodes.
Compiler | PGI | GNU | Serial/Parallel |
---|---|---|---|
C | pgcc | gcc | Serial/Parallel |
C++ | pgcc | g++ | Serial/Parallel |
Fortran 77 | pgf77 | gfortran | Serial/Parallel |
Fortran 90 | pgf90 | gfortran | Serial/Parallel |
IBM MPT codes are built using the above compiler commands with the addition of the -lmpi option on the link line. The following additional compiler wrapper scripts are used for building MPI codes:
Compiler | PGI | GNU | Serial/Parallel |
---|---|---|---|
MPI C | mpicc | mpicc | Parallel |
MPI C++ | mpicc | mpicc | Parallel |
MPI F77 | mpif77 | mpif77 | Parallel |
MPI F90 | mpif90 | mpif90 | Parallel |
To select one of these compilers for use, load its associated module. See Relevant Modules (below) for more details.
5.2.1. PGI C, C++, and Fortran Compiler
The latest versions of the PGI compiler suite are also available to provide compatibility and portability of codes from other systems.
Several optimizations and tuning options are available for code developed with all PGI compilers. The table below shows some compiler options that may help with optimization.
Option | Purpose |
---|---|
-O0 | disable optimization |
-g | create symbols for tracing and debugging |
-O1 | optimize for speed with no loop unrolling and no increase in code size |
-O2 or -default | default optimization, optimize for speed with inline intrinsic and loop unrolling |
-O3 | level -O2 optimization plus memory optimization (allows compiler to alter code) |
-Mipa | Enable and specify options for Interprocedural Analysis (IPA) |
The following tables contain examples of serial, MPI, and OpenMP compile commands for C, C++, and Fortran.
Programming Model | Compile Command |
---|---|
Serial | pgcc -O3 my_code.c -o my_code.x |
IBM Spectrum | pgcc -O3 my_code.c -o my_code.x -lmpi |
OpenMP | pgcc -O3 my_code.c -o my_code.x -mp |
Programming Model | Compile Command |
---|---|
Serial | pgc++ -O3 my_code.C -o my_code.x |
IBM Spectrum | pgc++ -O3 my_code.C -o my_code.x -lmpi |
OpenMP | pgc++ -O3 my_code.C -o my_code.x -mp |
Programming Model | Compile Command |
---|---|
Serial | pgf90 -O3 my_code.f90 -o my_code.x |
IBM Spectrum | pgf90 -O3 my_code.f90 -o my_code.x -lmpi |
OpenMP | pgf90 -O3 my_code.f90 -o my_code.x -mp |
5.2.2. GNU Compiler
The default GNU compilers are good for compiling utility programs but are probably not appropriate for computationally intensive applications. It is available without loading a separate module. The primary selling point of using GNU compilers is the compatibility between different architectures. They can be executed using the compiler commands in Available Compilers. For GNU compilers, the -O flag is the basic optimization setting.
More GNU compiler information can be found in the GNU gcc 4.8.5 manual.
5.3. Relevant Modules
If you compile your own codes, you need to select which compiler and MPI
version you want to use. For example:
module load compiler/pgi/x.x mpi/openmpi/x.x.x
These same module commands should be executed in your batch script before executing your program.
SCOUT provides individual modules for each compiler and MPI version. To see the list of currently available modules, use the module avail command. You can use any of the available MPI versions with each compiler by pairing them together when you load the modules.
The table below shows the naming convention used for various modules.
Module | Module Name |
---|---|
GCC Compilers | compiler/gcc/#.#.# |
PGI Compilers | compiler/pgi/#.# |
Go Compilers | compiler/go/#.# |
IBM Spectrum MPI | mpi/spectrum/#.# |
OpenMPI | mpi/openmpi/#.# |
For more information on using modules, see the ARL DSRC Modules Guide.
5.4. Libraries
5.4.1. Basic Linear Algebra Subprogram (BLAS)
The BLAS library is a set of high-quality routines for performing basic vector and matrix operations. There are three levels of BLAS operations:
- BLAS Level 1: vector-vector operations
- BLAS Level 2: matrix-vector operations
- BLAS Level 3: matrix-matrix operations
More information on the BLAS library can be found at http://www.netlib.org/blas.
5.4.2. Additional Math Libraries
There is also an extensive set of Math libraries available in the /opt/ibmmath/essl/6.2 directory on SCOUT. Information about these libraries may be found on the Baseline Configuration Web site at BC policy FY13-01.
5.5. Debuggers
5.5.1. GNU Project Debugger (gdb)
gdb works similarly to dbx and can be invoked either with
a program for execution or a running process id. To use gdb
to debug a program during execution, use:
gdb a.out corefile
To debug a process currently executing on this node, use:
gdb a.out pid
For more information, the GDB manual can be found at http://sourceware.org/gdb/current/onlinedocs/gdb.
5.6. Code Profiling and Optimization
Profiling is the process of analyzing the execution flow and characteristics of your program to identify sections of code that are likely candidates for optimization, which increases the performance of a program by modifying certain aspects for increased efficiency.
We provide the profiling tool, gprof, to assist you in the profiling process. A basic overview of optimization methods with information about how they may improve the performance of your code can be found in Performance Optimization Methods (below).
5.6.1. GNU Project Profiler (gprof)
gprof shows how your program is spending its time and which functions calls are made. To profile code using gprof, use the -pg option during compilation. For more information, the gprof manual can be found at http://sourceware.org/binutils/docs/gprof/index.html.
5.6.2. Program Development Reminders
If an application is not programmed for distributed memory, then only the cores on a single node can be used. This is limited to 16 cores on SCOUT.
Check the utilization of the nodes your application is running on to see if it is taking advantage of all the resources available to it. This can be done by finding the nodes assigned to your job by executing bstatus JOB_ID, logging into one of the nodes using the ssh command, and then executing the top command to see how many copies of your executable are being executed on the node.
Keep the system architecture in mind during code development. For instance, if your program requires more memory than is available on a single node, then you need to parallelize your code so it can function across multiple nodes.
5.6.3. Performance Optimization Methods
Optimization generally increases compilation time and executable size, and it may make debugging difficult. However, it usually produces code that runs significantly faster. The optimizations you can use vary depending on your code and the system on which you are running.
Note: Before considering optimization, you should always ensure your code runs correctly and produces valid output.
In general, there are five main categories of optimization:
- Global Optimization
- Loop Optimization
- Interprocedural Analysis and Optimization(IPA)
- Function Inlining
- Profile-Guided Optimizations
Global Optimization
A technique that looks at the program as a whole and may perform any of the following actions:
- Performed on code over all its basic blocks
- Performs control-flow and data-flow analysis for an entire program
- Detects all loops, including those formed by IF and GOTOs statements and performs general optimization.
- Constant propagation
- Copy propagation
- Dead store elimination
- Global register allocation
- Invariant code motion
- Induction variable elimination
Loop Optimization
A technique that focuses on loops (for, while, etc.) in your code and looks for ways to reduce loop iterations or parallelize the loop operations. The following types of actions may be performed:
- Vectorization - rewrites loops to improve memory access performance. Compilers on SCOUT can automatically convert loops to utilize the instructions and registers on processors if they meet certain criteria.
- Loop unrolling - (also known as "unwinding") replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization.
- Parallelization - divides loop operations over multiple processors where possible.
Interprocedural Analysis and Optimization (IPA)
A technique that allows the use of information across function call boundaries to perform optimizations that would otherwise be unavailable.
Function Inlining
A technique that seeks to reduce function call and return overhead.
- Used with functions that are called numerous times from relatively few locations.
- Allows a function call to be replaced by a copy of the body of that function.
- May create opportunities for other types of optimization.
- May not be beneficial. Improper use may increase code size and result in less efficient code.
Profile-Guided Optimizations
Profile-Guided optimizations are available that allow the compiler to make data driven decisions during compilation on branch predictions, increased parallelism, block ordering, register allocation, function ordering, and more. The build for this option takes about three steps though and uses a representative data set to come up with the optimizations.
For example:
Step 1: Instrumentation, Compilation, and Linking
gfortran -prof-gen -prof-dir ${HOME}/profdata -O2 -c a1.f a2.f a3.f
gfortran -o a1 a1.o a2.o a3.o
Step 2: Instrumentation Execution
a1
Step 3: Feedback Compilation
gfortran -prof-use -prof-dir ${HOME}/profdata -ipo a1.f a2.f a3.f
6. Batch Scheduling
6.1. Scheduler
The Load Sharing Facility (LSF) is currently running on SCOUT. It schedules jobs and manages resources and job queues and can be accessed through the interactive batch environment or by submitting a batch request. LSF can manage both single-processor and multiprocessor jobs. The LSF module is automatically loaded by the Master module on SCOUT at login.
6.2. Queue Information
The following table describes the LSF queues available on SCOUT:
Priority | Queue Name | Max Wall Clock Time | Max Cores Per Job | Max Queued Per User | Max Running Per User | Description |
---|---|---|---|---|---|---|
Highest | transfer | 48 Hours | N/A | N/A | N/A | Data transfer for user jobs. Not charged against project allocation. See the ARL DSRC Archive Guide, section 5.2. |
urgent | 96 Hours | N/A | N/A | N/A | Jobs belonging to DoD HPCMP Urgent Projects | |
debug | 1 Hour | N/A | N/A | N/A | Time/resource-limited for user testing and debug purposes | |
high | 168 Hours | N/A | N/A | N/A | Jobs belonging to DoD HPCMP High Priority Projects | |
frontier | 168 Hours | N/A | N/A | N/A | Jobs belonging to DoD HPCMP Frontier Projects | |
HIE | 24 Hours | N/A | N/A | N/A | Rapid response for interactive work. For more information see the HPC Interactive Environment (HIE) User Guide. | |
interactive | 12 Hours | N/A | N/A | N/A | Interactive jobs | |
standard | 168 Hours | N/A | N/A | N/A | Standard jobs | |
Lowest | background | 24 Hours | N/A | N/A | N/A | User jobs that are not charged against the project allocation |
6.3. Interactive Logins
When you log in to SCOUT, you will be running in an interactive shell on a login node. The login nodes provide login access for SCOUT and support such activities as compiling, editing, and general interactive use by all users. Please note the Login Node Abuse policy. The preferred method to run resource intensive executions is to use an interactive batch session.
6.4. Interactive Batch Sessions
An interactive session on a compute node is possible using a proper LSF command line syntax from a login node. Once LSF has scheduled your request on the compute pool, you are directly logged into a compute node, and this session can last as long as your requested wall time.
To submit an interactive batch job, use the following submission format:
bsub -Is -X -n 160 -m training -P Project_ID -q debug -gpu "num=2:mode=shared:j_exclusive=yes" -W 01:00 -x /bin/bash
Your batch shell request will be placed in the interactive queue and scheduled for execution. This may take a few minutes or much longer depending on the system load. Once your shell starts, you will be logged into the first compute node of the compute nodes assigned to your interactive batch job. At this point, you can run or debug applications interactively, execute job scripts, or start executions on the compute nodes you were assigned. The -X option enables X-Windows access, so it may be omitted if that functionality is not required for the interactive job.
6.5. Batch Request Submission
LSF batch jobs are submitted via the bsub command. The format
of this command is:
bsub < [ options ] batch_script_file
bsub options may be specified on the command line or embedded in the batch script file by lines beginning with #BSUB.
6.6. Launch Commands
There are different commands for launching MPI executables from within a batch job depending on which MPI implementation your script uses.
To launch an IBM Spectrum executable, mpiexec command as follows:
mpiexec -n #_of_MPI_tasks ./mpijob.exe
To launch an OpenMPI executable, use the openmpi_wrapper command
as follows:
openmpi_wrapper -n #_of_MPI_tasks ./mpijob.exe
For OpenMP executables, no launch command is needed.
6.7. Sample Script
The following script is a basic example. More thorough examples are available in the Sample Code Repository ($SAMPLES_HOME) on SCOUT.
Note: By default, GPU support is turned off. To turn it on, use the following:
mpirun -gpu
Using the -gpu option causes additional runtime checking
of every buffer passed to MPI. The -gpu flag is only required for
applications that pass pointers to GPU buffers to MPI API calls. Applications
that use GPUs, but do not pass pointers that refer to memory that is managed by
the GPU, are not required to pass the -gpu option.
#!/bin/csh # Specify job name. #BSUB -J myjob # Specify queue name. #BSUB -q standard #BSUB -n 40 # Specify how MPI processes should be distributed across nodes. #BSUB -R "span[ptile=20]" # Specify maximum wall clock time. #BSUB -W 24:00:00 # Specify Project ID to use. ID may have the form ARLAP96090RAY. #BSUB -P XXXXXXXXXXXXX # Specify that environment variables should be passed to master MPI process. #BSUB -V set JOBID=`echo $LSF_JOBID | cut -f1 d.` # Create a temporary working directory within $WORKDIR for this job run. set TMPD=${WORKDIR}/${JOBID} mkdir -p $TMPD # Change directory to submit directory # and copy executable and input file to scratch space cd $LSF_O_WORKDIR cp mpicode.x $TMPD cp input.dat $TMPD cd $TMPD # The following line provides an example of running a code built # with the gcc compiler and IBM Spectrum MPI. module load compiler/gcc/9.1.1 mpi/spectrum/10.03 mpiexec -n 48 ./mpicode.x > out.dat cp out.dat $LSF_O_WORKDIR exit ############ MPI+CUDA Hybrid Example ############ #BSUB -m inf[001-128] #BSUB -n 4 #BSUB -gpu "num=4:mode=shared:j_exclusive=yes" #BSUB -P ARLAP96090RAY #BSUB -J mpi_cuda_job #BSUB -o ./%J_hw.out #BSUB -e ./%J_hw.err #BSUB -x module unload compiler module unload mpi module unload cuda module load compiler/gcc/8.3.1 module load mpi/spectrum/10.03 module load cuda/10.2 nvidia-smi mkdir $WORKDIR/$JOBID && cd $WORKDIR/$JOBID mpirun -gpu -n 4 ./a.out >& output.dat
6.8. LSF Commands
The following commands provide the basic functionality for using the LSF batch system:
bsub: Used to submit jobs for batch processing.
bsub [options] my_job_script
bjobs: Used to check the status of submitted jobs.
bjobs $LSF_JOBID ## check one job bjobs -u my_user_name ## check all of user's jobs bjobs -u all -a ## check all jobs of all users
bhosts: Used to Display hosts and their static and dynamic resources.
bhosts -a
## returns host name, host status, job state statistics, and
## job slot limits for all hosts.
bqueues: Displays information about queues.
bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP transfer 110 Open:Active - - - - 0 0 0 0 urgent 100 Open:Active - - - - 0 0 0 0 debug 99 Open:Active - - - - 0 0 0 0 high 90 Open:Active - - - - 0 0 0 0 frontier 80 Open:Active - - - - 0 0 0 0 HIE 60 Open:Active - - - - 0 0 0 0 interactive 60 Open:Active - - - - 0 0 0 0 staff 50 Open:Active - - - - 0 0 0 0 standard 50 Open:Active - - - - 8709 4080 4629 0 background 1 Open:Active 1200 - - - 0 0 0 0
bkill: Used to kill queued or running jobs.
bkill $LSF_JOBID
7. Software Resources
7.1. Application Software
All Commercial Off The Shelf (COTS) software packages can be found in the $CSI_HOME (/p/app) directory. A complete listing of software on SCOUT with installed versions can be found on our Software page. The general rule for all COTS software packages is that the two latest versions will be maintained on our systems. For convenience, modules are also available for most COTS software packages.
7.2. Useful Utilities
The following utilities are available on SCOUT. For command-line syntax and examples of usage, please see each utility's online man page.
Name | Description |
---|---|
archive | Perform basic file-handling operations on the archive system |
bcmodule | An enhanced version of the standard module command |
check_license | Check the status of licenses for HPCMP shared applications |
mpscp | High-performance remote file copy |
node_use | Display the amount of free and used memory for login nodes |
show_usage | Display CPU allocation and usage by subproject |
Name | Description |
---|---|
dos2unix | Converts text to Unix format |
7.3. Sample Code Repository
The Sample Code Repository is a directory that contains examples for COTS batch scripts, building and using serial and parallel programs, data management, and accessing and using serial and parallel math libraries. The $SAMPLES_HOME environment variable contains the path to this area and is automatically defined in your login environment. Below is a listing of the examples provided in the Sample Code Repository on SCOUT.
Applications Application-specific examples; interactive job submit scripts; use of the
application name resource; software license use. | |
Sub-Directory | Description |
---|---|
abaqus | Basic batch script and input deck for an Abaqus application. |
adf | Basic batch script and input deck for an ADF application. |
ale3d | Basic batch script and input deck for an ALE3D application. |
ansys | Basic batch script and input deck for an ANSYS application. |
castep | Basic batch script and input deck for a CASTEP application. |
cfd++ | Basic batch script and input deck for a CFD++ application. |
cfx | Basic batch script and input deck for an ANSYS CFX application. |
comsol | Basic batch script and input deck for a COMSOL application. |
cth | Basic batch script and input deck for a CTH application. |
dakota | Basic batch script and input deck for a DAKOTA application. |
dmol3 | Basic batch script and input deck for a DMOL3 application. |
fluent | Basic batch script and input deck for a FLUENT (now ACFD) application. |
forcite | Basic batch script and input deck for a FORCITE application. |
fun3d | Basic batch script for a FUN3D application. |
GAMESS | auto_submit script and input deck for a GAMESS application. |
gaussian | Input deck for a GAUSSIAN application and automatic submission script for submitting a Gaussian job. |
gromacs | Basic batch script and input deck for a GROMACS application. |
iscampi | Example script for utilizing iscampi tool. |
lammps | Basic batch script and input deck for a LAMMPS application. |
ls-dyna | Basic batch script and input deck for a LS-DYNA application. |
lsopt | Basic batch script and input deck for an LS-OPT application. |
mathematica | Basic batch script and input deck for a MATHEMATICA application. |
matlab | Basic batch script and sample m file for a MATLAB application. |
mesodyn | Basic batch script and sample m file for a Mesadyn application. |
namd | Basic batch script and input deck for a NAMD application. |
OPENFOAM | Basic batch script and input deck for an OPENFOAM application. |
qe | Basic batch script and input deck for a QE application. |
picalc | Basic PBS example batch script. |
STARCCM+ | Basic batch script and input deck for a STRACCM+ applicatoin. |
vasp | Basic batch script and input deck for a VASP application. |
velodyne | Basic batch script and input deck for a VELODYNE application. |
xpatch | Basic batch script and input deck for a Xpatch application. |
Data_Management Archiving and retrieving files; Lustre striping; file searching; $WORKDIR use. | |
MPSCP_Example | Directory containing a README file giving examples of how to use the mpscp command to transfer files between Excalibur and remote systems. |
OST_Stripes | Description of how to OST striping to improve disk I/O. |
Postprocess_Example | Sample batch script showing how to submit a transfer queue job at the end of your computation job. |
Transfer_Example | Sample batch script showing how to stage data out after a job executes using the transfer queue. |
Transfer_Queue_with_Archive_Commands | Sample directory containing sample batch scripts demonstrating how to use the transfer queue to retrieve input data for a job, chain a job that uses that data to run a parallel computation, then chain that job to another that uses the transfer queue to put the data back in archive for long term storage. |
Documentation User Documentation | |
User_Manual_SGI-MPI.pdf | Scout User's Manual |
FlexLm Sample License server software commands | |
lmutil | Sample lmutil command |
rlmutil | Sample rlmutil command |
Parallel_Environment MPI, OpenMP, and hybrid examples; large number of nodes jobs; single-core jobs;
large memory jobs; running multiple applications within a single batch job. | |
Hybrid | Simple MPI/OpenMP hybrid example and batch script. |
Large_Jobs | A sample PBS job script is provided for you to copy for use to execute large jobs, those requiring more than 11,000 cores or 305 nodes. |
Large_Memory_Jobs | A sample large-memory jobs script. |
MPI_PBS_Examples | Sample PBS job scripts for SGI MPT and IntelMPI codes built with the Intel and GNU compilers. |
Multiple_Jobs_per_Node | Sample PBS job scripts for running multiple jobs on the same node. |
OpenMP | A simple Open MP example and batch script. |
Programming Basic code compilation; debugging; use of library files; static vs. dynamic
linking; Makefiles; Endian conversion. | |
COMPILE_INFO | Provides common options for Compiling and Configure |
Core_Files | Provides Examples of three core file viewers. |
DDT_Example | Using DDT to debug a small example code in an interactive batch job. |
Endian_Conversion | Instructions on how to manage data created on a machine with different Endian format. |
GPU_Examples | Several examples demonstrating use of system tools, compilation techniques, and PBS scripts to generate and execute code using the GPGPU accelerators on Excalibur. |
Intel_MPI_Example | Simple example of how to run a job built with IntelMPI. |
ITAC_Example | Example for using Intel Trace Analyzer and Collector. |
Large_Memory_Example | Simple example of how to run a job using Large-Memory nodes. |
Memory_Usage | Sample build and script that shows how to determine the amount of memory being used by a process. |
MKL_BLACS_Example | Example of how to build and run codes built using the INTEL MKL BLACS libraries |
MKL_ScaLAPACK_Example | Example of how to build and run codes built using the INTEL MKL ScaLAPACK libraries. |
MPI_Compilation | Examples of how to build SGI MPT, IntelMPI and OpenMPI code. |
Open_Files_Limits | This example discusses the maximum number of simultaneously open files an MPI process may have, and how to adjust the appropriate settings in a PBS job. |
SO_Compile | Simple example of creating a SO (Shared Object) library and using it to compile and running against it on the compute nodes. |
Timers_Fortran | Serial Timers using Fortran Intrinsics f77 and f90/95. |
VTune | Example to use Intel Vtune |
User_Environment Use of modules; customizing the login environment. | |
Module_Swap_Example | Instructions for using module swap command. |
Workload_Management Basic batch scripting; use of the transfer queue; job arrays; job
dependencies; Secure Remote Desktop; job monitoring. | |
BatchScript_Example | Basic PBS batch script example. |
Core_Info_Example | Sample code for generating the MPI process/core or OpenMP thread/core associativity in compute jobs. |
Documentation | Microsoft Word version of the PBS User's Guide. |
Hybrid_Example | Simple MPI/OpenMP hybrid example and batch script. |
Interactive_Example | Instructions on how to submit an interactive PBS job. |
Job_Array_Example | Instructions and example job script for using job arrays. |
Job_Dependencies_Example | Example scripts on how to use PBS job dependencies |
8. Links to Vendor Documentation
8.1. IBM Links
IBM Home: http://www.ibm.com
IBM Power9:
https://www.ibm.com/it-infrastructure/power
IBM IC922 Inference Nodes:
https://www.ibm.com/products/power-system-ic922
IBM AC922 Training Nodes:
https://www.ibm.com/products/power-systems-ac922
8.2. Red Hat Link
Red Hat Home: http://www.redhat.com
8.3. GNU Links
GNU Home: http://www.gnu.org
GNU Compiler: http://gcc.gnu.org
8.4. PGI Links
PGI Home: http://www.pgroup.com
PGI Compiler Documentation:
http://www.pgroup.com/resources/docs.php
9. Glossary
- Login Node :
- a node that serves as the user's entry point into an HPC system
- Compute Node :
- a node that performs computational tasks for the user. There may be multiple types of compute nodes for specialized purposes.
- Parallel File System :
- A software component designed to store data across multiple networked servers and to facilitate high-performance access through simultaneous, coordinated input/output operations (IOPS) between clients and storage nodes.
- Batch-scheduled :
- users request compute nodes via commands to batch scheduler software and wait in a queue until the requested nodes become available
- Batch Job :
- a single request for a set of compute nodes along with a set of tasks (usually in the form of a script) to perform on those nodes
- a programming methodology where a set of processors (such as the cores within one node) have direct access to a shared pool of memory
- Distributed Memory Model :
- a programming methodology where memory is distributed across multiple nodes giving processes on each node faster direct access to local memory, but requiring slower techniques such as message passing to access memory on other nodes
- Kerberos :
- authentication and encryption software required by the HPCMP to access HPC system login nodes and other resources. See Kerberos & Authentication