HPCMP Cloud
Quick Start Guide

1. HPCMP Cloud Introduction

The Department of Defense (DoD) High Performance Computing (HPC) organization is at a pivotal moment where strategic choices made today will define how the High Performance Computing modernization effort will align to DoD, and Joint imperatives to achieve success for the future. The High Performance Computing Modernization Program (HPCMP) is currently pivoting operations to include a cloud capability to perform high-speed network communications, and computational science expertise that enables the Department of Defense laboratories and test centers to conduct a wide range of focused research, development, test, evaluation, and acquisition activities spanning several computational technology areas (CTAs). This new requirement will mature and develop the cloud based HPC capability for HPCMP users enhancing the existing cybersecurity infrastructure, governance, and operating framework.

This document provides a brief summary of information that you'll need to know to quickly get started working on HPCMP Cloud.

2. How to get an HPCMP Cloud Account

New HPCMP cloud users can request an HPCMP cloud account by submitting the HPCMP Cloud User Resource Request Form. The form contents will be sent to the Resource Management (RM) team at (require@hpc.mil).

3. How to Connect to the HPCMP Cloud

3.1. Download SSH Key

Please see the section for Signing onto the Login Node in the HPCMP Cloud User Guide.

3.2. Sign onto Login Node

3.2.1. Using a UNIX Terminal

Place the key where you will be using it from and make sure to change the permission so that it's accessible only by you. A chmod 400 <key-file> will work. Then to SSH, enter:

ssh -i /path_to_key_pair/my-key-pair.pem username@52.222.57.229
3.2.2. Using Putty

Connect to server using private key - https://devops.ionos.com/tutorials/use-ssh-keys-with-putty-on-windows/#use-existing-public-and-private-keys

Your username will be the first part of your email address used to create the AWS account. For example, if your email used was username@example.com then your username will be username. The IP address for the login node is 52.222.57.229.

Download the .pem file if you intend to use a UNIX terminal or the .ppk version if you will be using Putty.

4. Home and Working Directories

Each user has file space in their respective $HOME directories. The $HOME environment variable is predefined for you and points to the appropriate locations in the file systems. You are strongly encouraged to use this variable in your scripts.

A default $WORKDIR variable is not setup automatically. Each cluster will have a "scratch" file system that can be manually pointed to with the $WORKDIR variable. The $WORKDIR file system is not backed up. You are responsible for managing files in your $WORKDIR directories by backing up files to the login node and deleting unneeded files. The $WORKDIR files that have not been saved will be deleted and permanently lost with cluster deletion.

You WILL NOT be notified prior to scratch workspace deletion. You are responsible to monitor your workspace to prevent data loss.

5. Transferring Files and Data to the HPCMP Cloud

File transfers to HPCMP Cloud can be performed with tools such as: scp, sftp, and mpscp. For example, the command below uses secure copy (scp) to copy a local file into a destination directory on a HPCMP login node.

% scp local_file user@52.222.57.229:/target_dir

6. Submitting Jobs

Slurm is the workload management system for HPCMP Cloud. To submit a job, use the following command:

sbatch [ options ] my_job_script

where my_job_script is the name of the file containing your batch script. For more information on using Slurm or on job scripts, see the Slurm User Guide.

7. Batch Queues

There will be no queues on HPCMP Cloud since all jobs can be run on dedicated clusters.

8. Monitoring Your Job

You can monitor your job using the squeue command. The ST field is job state. Two jobs are in a running state (R is an abbreviation for Running) while one job is in a pending state (PD is an abbreviation for Pending). The TIME field shows how long the jobs have run for using the format days-hours:minutes:seconds. The NODELIST(REASON) field indicates where the job is running or the reason it is still pending.

The squeue command has many options to easily let you view the information of interest to you in whatever format you prefer. See the squeue man page for more information.

9. Saving Your Work

When your job is finished, you should copy any important data to your $HOME on the login node to prevent automatic deletion along with the cluster.

10. Available Software

The login node is equipped with basic utilities like the AWS CLI and ParallelCluster to help you launch your computing cluster. All required software will be made available on the Master/Head node of the cluster by using the image that best matches your use case.

11. Job Reservations

No advance reservation is needed.