HPCMP Cloud
User Guide
Table of Contents
- 1. Introduction
- 1.1. Document Scope and Assumptions
- 1.2. Policies to Review
- 1.3. Obtaining an Account
- 2. System Configuration
- 2.1. System Summary
- 2.2. Operating System
- 2.3. File Systems
- 3. Accessing the System
- 3.1. Signing onto the Login Node
- 3.2. PuTTY
- 3.3. Using a Linux Shell
- 3.4. Logging In
- 3.5. File Transfers
- 4. User Environment
- 4.1. User Directories
- 4.2. Shells
- 4.3. Modules
- 4.4. Archive Usage
- 5. Program Development
- 5.1. Programming Models
- 5.2. Available Compilers
- 5.3. Debuggers, Code Profiling, and Optimization
- 6. Job Scheduling
- 7. Setting Up MFA Access
- 8. Importing Your Own SSH Key Pair
- 8.1. From the AWS CLI (from a local Linux system)
- 8.2. From the AWS Console (from a local Windows system)
- 8.3. Retrieving the public key for your key pair on Linux
- 8.4. Importing keys into AWS
- 8.5. Replacing your login node key pair
- 9. Configuring AWS CLI
- 10. Launching a Cluster
- 10.1. Configure ParallelCluster
- 10.2. Launch Cluster Command
- 10.3. AMI List and Config Files
- 11. Running a Job
- 12. Deleting the Cluster
- 13. Batch Scheduling
- 13.1. Scheduler
- 13.2. Interactive Logins
- 14. Recommended Workflow
1. Introduction
1.1. Document Scope and Assumptions
This document provides an overview and introduction to the use of the Open Research Service, HPCMP Cloud, located in AWS GovCloud, along with a description of the specific computing environment on HPCMP Cloud. The intent of this guide is to provide information that will enable the average user to perform computational tasks on the system. To receive the most benefit from the information provided here, you should be proficient in the following areas:
- Use of the UNIX operating system
- Use of an editor (e.g., vi or emacs)
- Remote usage of computer systems via network or modem access
- A selected programming language and its related tools and libraries
1.2. Policies to Review
Users are expected to be aware of the following policies for working on the HPCMP Cloud environment.
1.2.1. Login Node Abuse Policy
The login nodes, cloud01-cloud02, provide login access for HPCMP Cloud and support such activities as the launching of a cluster and general interactive use by all users. Consequently, memory or CPU-intensive programs running on the login nodes can significantly affect all users of the system. Therefore, only small applications requiring less than 10 minutes of runtime and less than 8 GBytes of memory are allowed on the login nodes. Any job running on the login nodes that exceeds these limits may be unilaterally terminated.
1.2.2. Workspace Management Policy
A default $WORKDIR variable is not setup automatically. Each cluster will have a "scratch" file system that can be manually pointed to with the $WORKDIR variable. The $WORKDIR file system is not backed up. You are responsible for managing files in your $WORKDIR directories by backing up files to the login node and deleting unneeded files. The $WORKDIR files that have not been saved will be deleted and permanently lost with cluster deletion.
Note: You WILL NOT be notified prior to scratch workspace deletion. You are responsible to monitor your workspace to prevent data loss.
1.3. Obtaining an Account
Please navigate to the following link: https://centers.hpc.mil/systems/ors.html#account
2. System Configuration
2.1. System Summary
HPCMP Cloud is an AWS HPC system. The login nodes are populated with AMD EPYC 7000 processors. HPCMP Cloud allows the user to create a compute cluster specific to their needs. The user optimizes the compute, network, and storage as required. The table below provides three different sample configurations. The user is welcome to modify further.
All clusters will be configured with compute nodes that share memory only on the node; memory is not shared across the nodes. Based on the I/O requirements for a given application, the cluster can be configured with the needed shared $WORKDIR.
HPCMP Cloud is intended to be used as a dedicated HPC system without the need for queuing jobs. Its login nodes are not to be used for computational (e.g., memory, IO, long executions) work. All executions that require large amounts of system resources must be sent to the compute nodes by cluster creation and job submission from the master/head node.
Login Nodes | Compute Nodes | ||||
---|---|---|---|---|---|
High Compute | Moderate Compute | GPU Accelerated | |||
Instance Type | t3a.xlarge | c5n.18xlarge | c5n.4xlarge | p3.16xlarge | |
Total Cores | Nodes | 2 | 1 | 4,104 | 114 | 4,096 | 512 | 4,096 | 128 | |
Operating System | Amazon Linux 2 | ||||
Cores/Node | 2 | 36 | 8 | 32 | |
Core Type | AMD EPYC 7000 | Intel Advanced Vector Extension 512 | Intel Xeon E5-2686 v4 | ||
Core Speed | 2.5 GHz | 3.0 - 3.5 GHz | 2.5 GHz | ||
Memory/Node | 16 GBytes | 192 GBytes | 42 GBytes | 488 GBytes | |
Accessible Memory/Node | 16 GBytes | 192 GBytes | 42 GBytes | 488 GBytes | |
Memory Model | Shared on node. | ||||
Interconnect Type | Ethernet |
Path | Capacity | Type |
---|---|---|
/p/home ($HOME) | 500 GBytes | EFS |
2.2. Operating System
The operating system on HPCMP Cloud's login and compute nodes is Amazon Linux 2. The compute nodes can provide access to dynamically shared objects and most of the typical Linux commands and basic functionality.
2.3. File Systems
HPCMP Cloud has the following file systems available:
2.3.1. /home/username/
This file system is an AWS EFS (Elastic File System), which is the equivalent of an NFS, locally mounted from HPCMP Cloud's login node. It has virtually unlimited storage capacity. All users have a home directory on this EFS which may be referenced by the environment variable $HOME.
Note: The home directory on your login node is not visible to the master node in the cluster. Files must be manually moved onto the master node from your login node.
2.3.2. Working Directory
This file system is locally mounted on the compute nodes. Each user will have to manually point to a working directory which may be referenced for example by the environment variable $WORKDIR. This directory may be one of the few different types listed below.
2.3.3. Elastic Block Storage (EBS)
The ParallelCluster utility (see section Lauching a Cluster) automatically creates and mounts an EBS of 20 GiB under the "/shared" directory that is accessible to the entire cluster (master/head node as well as compute nodes).
The default name may be changed by modifying the appropriate parameter in the Login Node $HOME/.parallelcluster/config file under the EBS section. The following example shows a 50 GiB shared Amazon EBS volume mounted at /myshared:
shared_dir = myshared volume_size = 50
2.3.4. Elastic File System (EFS)
Alternatively, if an undefined amount of scratch space will be needed by the job, then it is recommended to use an EFS shared file system. Amazon's EFS expands as data is being written to it and therefore the capacity is virtually unlimited. An EFS may be provisioned by modifying the appropriate parameter in the Login Node $HOME/.parallelcluster/config file under the EFS section. The following example mounts Amazon EFS at /efs:
shared_dir = efs
For further information on optimizing the shared scratch space for your cluster, refer to the AWS ParallelCluster User Guide.
3. Accessing the System
For instructions on setting up your login node for the very first time you login, please refer to the Signing onto the Login Node section below.
3.1. Signing onto the Login Node
3.1.1. Download the SSH Key
Navigate to the following location (S3 buckets list within AWS) and find the bucket belonging to you. This bucket will be named in 'username-bucket' format. Locate your bucket and click on it. It will contain a private key named 'username.pem'. Only you have access to this bucket and can download this key.
https://console.amazonaws-us-gov.com/s3/buckets/dev-keypairhpcmp/?region=us-gov-east-1&tab=overview
Download the .pem file to use a UNIX terminal for login.
Note: This is a personal key-pair that only the user is meant to have access to.
The corresponding public key will be available in AWS and can be used during cluster creation as will be seen below.
Alternatively, a user may create other SSH key-pairs as needed (see the Importing Your Own SSH Key Pair section).
Your username will be the first part of your email address used to create the AWS account. For example, if your email used was username@example.com then the username will be username. The IP address for the login node is 52.222.57.229.
3.2. PuTTY
3.2.1. Using PuTTY
Download PuTTY. PuTTY is a popular SSH (Secure Shell) client for Windows. It is typically used for remote access to server computers over a network using the SSH protocol.
Connect to server using private key tutorial - https://devops.ionos.com/tutorials/use-ssh-keys-with-putty-on-windows/#use-existing-public-and-private-keys
3.2.2. Converting .pem to .ppk on Windows
- Click on Start menu > All Programs > PuTTY > PuTTYgen.
- In the following window select the option 'RSA' under 'Type of key to generate'.
- Next, click on the option 'Load.' As PuTTY supports its native file format, it will only show files that have .ppk file extension. Therefore, users have to choose the 'All Files' option from the drop-down bar. It will display all key files included the .pem file.
- Now, select the .pem file that you want to convert. As aforementioned that PuTTYgen is used for SSH connectivity, so it crucial for users to select the specific file that they plan to convert and click 'Open.' To confirm, click on 'OK.'
- In the resultant window, click on 'Save private key' which will convert and save the key file in PuTTY compatible format.
- PuTTYgen will prompt a warning of saving the key without a passphrase. Click 'Yes'.
- Name your file and PuTTYgen will automatically add the .ppk file extension to it.
3.2.3. Adding the Key to Pageant
-
Start Pageant from the PuTTY folder: Start-Menu > All Programs > PuTTY > Pageant
-
Pageant starts by default minimized in the system tray. To begin adding your SSH keys, right-click the Pageant icon, and then the following context menu will appear:
-
Clicking on Add Key from the menu or View Keys to open up the Pageant Key List window. Here you can view, add, and remove keys:
-
Click the Add Key button. This will open the file explorer, where you can choose one or more keys at a time to load. You should select files with the .ppk extension:
-
Click the Open button to load the keys with Pageant. After successfully adding a key, you can now see it listed:
-
Once the key has been added to pagent:
- Enter the remote server Host Name or IP address under Session.
- Navigate to Connection > SSH > Auth.
- Click Browse... under Authentication parameters / Private key file for authentication.
- Locate the private key and click Open.
- Make sure to check "Allow agent forwarding"
- Finally, click Open again to log into the remote server with key pair authentication.
- Global Optimization
- Loop Optimization
- Interprocedural Analysis and Optimization(IPA)
- Function Inlining
- Open the Amazon EC2 console at https://console.amazonaws-us-gov.com/ec2/home?region=us-gov-west-1#KeyPairs
- Choose Create key pair.
- For Name, enter a descriptive name for the key pair, for example, MyKeyPair. The name can include up to 255 ASCII characters. It cannot include leading or trailing spaces.
- For File format, choose the format in which to save the private key. To save the private key in a format that can be used with OpenSSH, choose pem. To save the private key in a format that can be used with PuTTY, choose ppk.
- Choose Create key pair.
- The private key should download.
- If you created a key in the 'west' region for example, you can import it into the 'east' region as follows:
- Navigate to the east region Key Pairs page - https://us-gov-east-1.console.amazonaws-us-gov.com/ec2/home?region=us-gov-east-1#KeyPairs.
- Click on Actions, then Import key pair
- Name your key. It can be the same as for the west region. Then paste in the public key information that was obtained in the previous section, and click Import key pair. Your key is now available in the east region.
- Create a new key pair using the Amazon EC2 console or a third-party tool.
- Retrieve the public key from your new key pair as shown above.
- Connect to your instance using your existing private key file.
- Using a text editor of your choice, open the .ssh/authorized_keys file on the instance. Paste the public key information from your new key pair underneath the existing public key information. Save the file.
- Disconnect from your instance, and test that you can connect to your instance using the new private key file.
- When you're replacing an existing key pair, connect to your instance and delete the public key information for the original key pair from the .ssh/authorized_keys file.
- Sign in to the AWS Management Console and open the IAM console by clicking your username, then "My Security Credentials."
- In the Access keys section, choose Create access key.
- To view the new access key pair, choose Show. You will not have access to the secret access key again after this dialog box closes. Your credentials will look something like this:
- Access key ID: AKIAIOSFODNN7EXAMPLE
- Secret access key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
- To download the key pair, choose Download .csv file. Store the keys in a secure location. You will not have access to the secret access key again after this dialog box closes.
- Keep the keys confidential in order to protect your AWS account and never email them. Do not share them outside your organization, even if an inquiry appears to come from AWS or Amazon.com. No one who legitimately represents Amazon will ever ask you for your secret key.
- After you download the .csv file, choose Close. When you create an access key, the key pair is active by default, and you can use the pair right away.
- Set up your AWS credentials using the AWS Key that you just created. Make sure you enter the correct region name.
aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [us-east-1]: us-gov-west-1 Default output format [None]:
3.2.4. Checking the Path
Once on the Login Node, run the following command:
echo $PATH
The path should include /home/"username"/.local/bin
3.3. Using a Linux Shell
Place the key where you will be using it from and make sure to change the permission so that it's accessible only by you. chmod 400 <key-file> will work.
Before logging in through SSH, you can modify your .ssh/config file to configure your session and enable AgentForwarding. This allows you to SSH from your local machine to the Login Node, and once you create your cluster, seamlessly SSH onto your cluster's Master Node.
Alternatively, you may copy your private key over to your account on the Login Node and SSH from there.
Your config file may contain multiple entries. Add one for the Login Node:
Host devlog # you can pick a name of your choosing HostName 52.222.57.229 User <your-username> ForwardAgent yes IdentityFile ~/.ssh/DevLoginNode.pem #point to your key file here
In order to enable Agent Forwarding, you must add to the ssh agent by running the following command each time you start your computer.
ssh-add <your-key>
Then to login:
ssh devlog
Or SSH explicitly using the following command:
ssh -i /path_to_key_pair/my-key-pair.pem username@52.222.57.229
3.4. Logging In
ssh -i /path_to_key_pair/my-key-pair.pem username@52.222.57.229
Your username will be the first part of your email address used to create the AWS account. For example, if your email used was username@example.com then the username will be username.
3.5. File Transfers
3.5.1. On Linux
File transfers to Login node will be performed using the following tools: scp, psftp, sftp and pscp. The command below uses secure copy (scp) to copy a single local file into a destination directory on a HPCMP login node.
scp -i mykeypair.pem test.txt username@52.222.57.229:/home/username
Alternatively, the example below uses the secure file transfer protocol (sftp) to connect to HPCMP Login Node, then uses the sftp, cd, and put commands to change to the destination directory and copy a local file there. The sftp quit command ends the sftp session. Use the sftp help command to see a list of all sftp commands.
sftp -i mykeypair.pem user@52.222.57.229 sftp> cd target_dir sftp> put local_file sftp> quit
3.5.2. On Windows
Download the PSCP utility from PuTTy.org by clicking the file name link and saving it to your computer. To open a Command Prompt window, from the Start menu, click Run, and if necessary, set up your path variable.
pscp -i mykeypair.ppk c:\documents\info.txt username@server.example.com:/home/username/info.txt
4. User Environment
The following user directories are provided for all users on HPCMP Cloud.
4.1. User Directories
4.1.1. Home Directory
When you log on to the login node, you will be placed in your home directory, /home/username. The environment variable $HOME is automatically set for you in the login node and refers to this directory. $HOME is visible to only to the login node and may be used to store all user files.
Note: On the master node of each cluster, $HOME will point to the /home/ec2-user directory, which is not the same as or linked to the /home/username directory on the login node.
4.1.2. Work Directory
When you log on to the Login Node, you will be placed in your home directory, /home/username. The environment variable $HOME is automatically set for you and refers to this directory.
You may manually set the $WORKDIR variable to point to the "scratch" space or working directory of your choice on the compute cluster. For example, if you want to set the "/shared" directory as your work directory, do so as shown below:
WORKDIR=/shared echo $WORKDIR /shared
Note: All of your jobs execute from your $WORKDIR directory on the compute cluster, not $HOME of the login node. Copy any important files to $HOME on the login node from the $WORKDIR of the compute cluster before deleting the cluster.
4.2. Shells
The following shells are available on HPCMP Cloud: csh, bash, tcsh, and sh. You may use the chsh command to change your default shell as you please.
4.3. Modules
Software modules are a convenient way to set needed environment variables and include necessary directories in your path so that commands for particular applications can be found.
Note: Currently no modules are loaded by default. The user must set up their modules manually.
4.4. Archive Usage
No archive process is in place. User may store their files in the $HOME of the Login Node.
5. Program Development
5.1. Programming Models
Intel MPI
5.2. Available Compilers
GNU Compiler Collection
The GNU Programming Environment provides a large number of options that are the same for all compilers in the suite. The following table lists some of the more common options that you may use:
5.2.1. GNU Compiler Options
Option | Purpose |
---|---|
-c | Generate intermediate object file but do not attempt to link. |
-l directory | Search in directory for include or module files. |
-L directory | Search in directory for libraries. |
-o outfile | Name executable "outfile" rather than the default "a.out". |
-Olevel | Set the optimization level. For more information on optimization, see the section on Profiling and Optimization below. |
-g | Generate symbolic debug information. |
-fconvert=big-endian | Big-endian files; the default is for little-endian. |
-Wextra -Wall |
Turns on increased error reporting. |
5.3. Debuggers, Code Profiling, and Optimization
No licensed utilities are available at this time. The GNU Debugger (gdb) is available.
5.3.1. Compiler Optimization Options
The "-Olevel" option enables code optimization when compiling. The level that you choose (0-4) will determine how aggressive the optimization will be. Increasing levels of optimization may increase performance significantly, but you should note that a loss of precision may also occur. There are also additional options that may enable further optimizations. The following table contains the most commonly used options.
Option | Description |
---|---|
-O0 | No Optimization. (default in GNU) |
-O1 | Scheduling within extended basic blocks is performed. Some register allocation is performed. No global optimization. |
-O2 | Level 1 plus traditional scalar optimizations such as induction recognition and loop invariant motion are performed by the global optimizer. Generally safe and beneficial. (default in PGI, Cray, & Intel) |
-O3 | Levels 1 and 2 plus more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable. Generally beneficial. |
5.3.2. Performance Optimization Methods
Optimization generally increases compilation time and executable size, and may make debugging difficult. However, it usually produces code that runs significantly faster. The optimizations that you can use will vary depending on your code and the system on which you are running.
Before considering optimization, you should always ensure that your code runs correctly and produces valid output.
In general, there are four main categories of optimization:
6. Job Scheduling
HPCMP Cloud is intended as an elastic, scalable and on-demand cloud infrastructure to run your HPC applications. A dedicated cluster is stood up by the user for the duration of their job(s) and then taken down once the job has completed.
The Login Node serves as the single point of entry into the environment, and this is where the user performs cluster creation and deletion with the ParallelCluster utility available on the login node (described in detail below). ParallelCluster creates a master or head node along with a fleet of compute nodes as specified by the user. The user submits and monitors jobs from the master node.
7. Setting Up MFA Access
Your permissions within the AWS environment are limited by default for enhanced security. To gain elevated privileges needed to launch a cluster and perform other related functions, you must obtain temporary credentials that authenticate you using your Multi-factor Authentication (MFA) token (shown below) for 12 hours (or the length of your session).

If your SSH session is terminated, you must run the request_session.sh shell script again to re-authenticate with MFA.
For a first-time user, the following configuration is needed from the Login Node:
vi request_session.sh #or text editor of choice
Edit the snumber parameter to equal the Serial Number on the back of your MFA token, save and quit.
#!/bin/bash # Store your MFA Token's Serial Number snumber=SPBT<your-serial-number> ...
Now run the script as follows:
. ./request_session.sh Please enter your token code (obtained from your MFA token): <enter your 6 digit token code> Authentication Successful! Your session will expire in 12 hours.
You are now authenticated with elevated permissions.
8. Importing Your Own SSH Key Pair
AWS EC2 key pairs are required to access EC2 instances via SSH.
8.1. From the AWS CLI (from a local Linux system)
Using the following command locally saves the private key with a name you specify, for ex. MyKeyPair.pem
aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
Change the permissions on the key to prevent tampering
chmod 400 MyKeyPair.pem
8.2. From the AWS Console (from a local Windows system)
8.3. Retrieving the public key for your key pair on Linux
On your local Linux or macOS computer, you can use the ssh-keygen command to retrieve the public key for your key pair. Specify the path where you downloaded your private key (the .pem file).
ssh-keygen -y -f /path_to_key_pair/my-key-pair.pem
The command returns the public key, as shown in the following example.
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQClKsfkNkuSevGj3eYhCe53pcjqP3maAhDFcvBS7O6V hz2ItxCih+PnDSUaw+WNQn/mZphTk/a/gU8jEzoOWbkM4yxyb/wB96xbiFveSFJuOp/d6RJhJOI0iBXr lsLnBItntckiJ7FbtxJMXLvvwJryDUilBMTjYtwB+QhYXUMOzce5Pjz5/i8SeJtjnV3iAoG/cQk+0FzZ qaeJAAHco+CY/5WrUBkrHmFJr6HcXkvJdWPkYQS3xqC0+FmUZofz221CBt5IMucxXPkX4rWi+z7wB3Rb BQoQzd8v7yeb7OzlPnWOyN0qFU0XA246RA8QFYiCNYwI3f05p6KLxEXAMPLE
If the command fails, run the following command to ensure that you've changed the permissions on your key pair file so that only you can view it.
chmod 400 my-key-pair.pem
These steps can also be performed on Windows using PuTTYgen.
8.4. Importing keys into AWS
8.5. Replacing your login node key pair
9. Configuring AWS CLI
The AWS CLI has already been installed on the login node. You can configure it using your credentials. The AWS Access Key ID and AWS Secret Access Key are your AWS credentials. They are associated with an AWS Identity and Access Management (IAM) user or role that determines what permissions you have.
Access keys consist of an access key ID and secret access key, which are used to sign programmatic requests that you make to AWS. If you don't have access keys, you can create them from the AWS Management Console.
The only time that you can view or download the secret access key is when you create the keys. You cannot recover them later. However, you can create new access keys at any time.
Note: Your account may already have an Access Key created and shown from the time of account creation. If you see there are keys in your account that you do not have access to, you should delete that key and create a new one as shown below.
To create access keys for an IAM user:
10. Launching a Cluster
Below is a sample configuration, your cloud admin (hpcaceadmin@jaspersolutions.com) will provide you with the appropriate login node $HOME/.parallelcluster/config file to use based on your use-case.
Note: The following tags are required for launching the cluster: hpcmp:user-name, hpcmp:group-name, hpcmp:project-name. They must be included in the [cluster default] section of your config file, as follows:
tags = {"hpcmp:user-name": "CLDproj00XX", "hpcmp:group-name": "dev", "hpcmp:project-name": "test"}
10.1. Configure ParallelCluster
pcluster configure
From the list of valid AWS Region identifiers, choose the us-gov-west-1 Region in which you want your cluster to run.
Allowed values for the AWS Region ID: 1. us-gov-east-1 2. us-gov-west-1 AWS Region ID [us-gov-west-1]:
Pick the EC2 Key Pair that you wish to use. Look at the section "Importing Your Own SSH Key Pair" (above) to create your own key pair.
Allowed values for EC2 Key Pair Name: 1. MyKeyPair EC2 Key Pair Name [MyKeyPair]:
Choose the scheduler to use with your cluster. SLURM works well with PBS scripts.
Allowed values for Scheduler: 1. sge 2. torque 3. slurm 4. awsbatch Scheduler [sge]:
Choose the operating system. CentOS is not supported in AWS GovCloud.
Allowed values for Operating System: 1. alinux 2. alinux2 3. ubuntu1604 4. ubuntu1804 Operating System [alinux]:
The minimum and maximum size of the cluster of compute nodes is entered. This is measured in number of instances.
Minimum cluster size (instances) [0]: Maximum cluster size (instances) [10]:
The master and compute nodes instance types are entered.
Master instance type [t2.micro]: Compute instance type [t2.micro]:
Do not create a new VPC. Please select the existing VPC-Test.
Automate VPC creation? (y/n) [n]: n Allowed values for VPC ID: 1. vpc-90bba9f4 | VPC-Test | 2 subnets inside 2. vpc-ad9e8cc9 | VPC-JasperDev | 1 subnets inside 3. vpc-d7090db3 | Default | 0 subnets inside VPC ID [vpc-0b4ad9c4678d3c7ad]: 1
After the VPC has been selected, decide whether to use existing subnets or create new ones. Please use the existing subnets in the Test VPC. Pick the public subnet for the master instance and the private subnet for the compute instances.
Automate Subnet creation? (y/n) [y]: n Allowed values for Master Subnet ID: 1. subnet-8524d6e2 | Subnet-TestPrivate | Subnet size: 16384 2. subnet-f23fcd95 | Subnet-TestPublic | Subnet size: 1024Master Subnet ID subnet-027125c6a81d73006]: 2 Allowed values for Compute Subnet ID: 1. subnet-8524d6e2 | Subnet-TestPrivate | Subnet size: 16384 2. subnet-f23fcd95 | Subnet-TestPublic | Subnet size: 1024 Compute Subnet ID [subnet-093c3f1589a870ff0]: 1
Configuration is complete.
Configuration file written to /home/"user"/.parallelcluster/config You can edit your configuration file or simply run 'pcluster create -c /home/"user"/.parallelcluster/config cluster-name' to create your cluster
10.2. Launch Cluster Command
When all settings contain valid values, you can launch the cluster by running the create command where 'mycluster' is the cluster name.
$ pcluster create mycluster
After the cluster reaches the "CREATE_COMPLETE" status, you can connect to it by using your normal SSH client settings.
$ pcluster ssh HelloCluster –i ~/.ssh/mysshkey.pem
OR
$ ssh –i MyKeyPair ec2-user@<Master server public IP>
10.3. AMI List and Config Files
Each config file will create a cluster with the associated Amazon Machine Image (AMI), wherein the master node of the cluster will have 50 GiB storage by default, and a "/shared" EBS volume that can be set as the $WORKDIR.
These config files are intended to serve as templates to get the user started and also allow the user to make any further customizations. The user is required to fill in the tags in the config file. They're located at:
Below is a catalog of AMIs as well the packages each AMI contains. You can pick the appropriate AMI from the list; wherein, total cores = number of nodes * number of cores per node.
Use Case | AMI ID | Region | Packages | Config File | Resources per Node |
---|---|---|---|---|---|
baseline | ami-74c4f215 | West | Base alinux2 created by AWS, used to build other AMIs | N/A | |
baseline_al2_EFA_w | ami-05284c732a36a5e4a | West | gcc, gmake, cmake, HDF5, LAPACK, git, netcdf, intel MPI and EFA | baseline_moderate_west baseline_moderate_efs_west |
8 cores |
avfleslie_al2_w | ami-0240f2a3ae87bb116 | West | baseline + avfleslie |
avfleslie_west | 36 cores |
FFTW_al2_EFA_w | ami-06b77e929540b0ceb | West | baseline + FFTW |
FFTW_west | 36 cores |
hycom_al2_EFA_w | ami-0dc68ef211321e106 | West | baseline + hycom |
hycom_efs_west | 36 cores |
tensorflow_cuda_al2_w | ami-0eefb36487bd23928 | West | cuda + anaconda + TensorFlow |
tensorflow_efs_west | 8 cores 1 GPU |
baseline_al2_EFA_e | ami-03ce8abecc3ef1ad8 | East | gcc, gmake, cmake, HDF5, LAPACK, git, netcdf, intel MPI and EFA | baseline_moderate_east baseline_moderate_efs_east |
8 cores |
emacs_al2_EFA_e | ami-0b8b9c58bd9d22181 | East | Baseline EFA with emacs |
emacs_baseline_moderate_east | 8 cores |
avfleslie_al2_e | ami-0e16badb17352c7c7 | East | baseline + avfleslie |
avfleslie_east | 36 cores |
FFTW_al2_EFA_e | ami-0a826ce9427a5a884 | East | baseline + FFTW |
FFTW_east | 36 cores |
hycom_al2_EFA_e | ami-0aeb3ec0a320e89f2 | East | baseline + hycom |
hycom_efs_east | 36 cores |
tensorflow_cuda_al2_e | ami-06a6d709837066ea8 | East | baseline + cuda + tensorflow |
tensorflow_efs_east | 8 cores 1 GPU |
10.3.1. AMI Tree
11. Running a Job
To verify the current state of your cluster, enter the following command (with SLURM scheduler). You'll see three hosts running. For example,
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST batch up infinite 2 alloc adev[8-9] batch up infinite 6 idle adev[10-15] debug* up 30:00 8 idle adev[0-7]
Set up the working directory variable to point to the shared EBS drive:
WORKDIR=/shared
Submit the job:
sbatch <script-name>
Monitor the job:
squeue
12. Deleting the Cluster
Once a job has completed, the cluster and all of its associated resources can be deleted with the following command issued from the Login Node.
pcluster delete mycluster
Note: All data on the cluster that hasn't been moved to the Login Node will be deleted along with the cluster.
13. Batch Scheduling
13.1. Scheduler
SLURM (Simple Linux Utility for Resource Management) is currently running on HPCMP Master Node. It is a software package for submitting, scheduling, and monitoring jobs on large compute clusters. Slurm is automatically loaded for you when you log in.
13.2. Interactive Logins
When you log in to the compute cluster, you will be running in an interactive shell on a master node. You can run Slurm in interactive mode by, for example:
srun --pty bash -i
Srun's --pty option runs task zero in pseudo terminal mode. Bash's -i option tells it to run in interactive mode (with prompts).
14. Recommended Workflow
The below process is an overview of the Login Node and the compute cluster with a sample workflow that a user might employ for running their job.
Note: It is assumed that the user already has an account on AWS and has performed the initial first-time login steps necessary to set up their accounts on the Login Node.
Transfer your files into your home directory of your login node by replacing username with yours.
scp -i <your-key> test.txt username@52.222.57.229:/home/username
Sign onto the Login Node.
ssh -A -i <your-key> username@52.222.57.229
Obtain temporary elevated crendentials.
. ./request_session.sh
Download a pre-configured config file to use with ParallelCluster from the designated S3 bucket.
aws s3 cp s3://dev-parallelcluster-config-files/East/Baseline/baseline_moderate_east $HOME/.parallelcluster/
Modify the config file to your specifications.
vi .parallelcluster/baseline_moderate_east [aws] aws_region_name = us-gov-east-1 [global] cluster_template = default update_check = true sanity_check = true [aliases] ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS} [cluster default] key_name = MyKeyPairEast base_os = alinux2 scheduler = slurm initial_queue_size = 1 max_queue_size = 10 master_instance_type = t3.micro compute_instance_type = c5n.4xlarge disable_hyperthreading = true placement_group = DYNAMIC custom_ami = ami-03ce8abecc3ef1ad8 master_root_volume_size = 50 tags = {"hpcmp:user-name": "CLDproj00 ", "hpcmp:group-name": " ", "hpcmp:project-name": " "} [vpc default] vpc_id = vpc-0c0abc0cb9540daaa master_subnet_id = subnet-0b37638f9ca65ea1e compute_subnet_id = subnet-00cf48ac5bec8c64b
Launch the cluster. Be sure to point to the config file that you edited.
pcluster create <cluster-name-of-choice> -c ~/.parallelcluster/baseline_moderate_east
Copy necessary files, code, and scripts to the master node.
scp -i <your-key> test.txt ec2-user@<master node ip address>:/
SSH to the master node.
ssh ec2-user@<master node ip address>
Setup the working directory. A 20 GiB EBS is automatically attached to the cluster. Designate this EBC as $WORKDIR.
WORKDIR=/shared
Compile your code.
Setup your job script.
Submit your job.
sbatch <script-name>
Monitor your job.
squeue
Move output files from the master node to the Login Node.
scp <output-files> username@52.222.57.229:/home/username
Delete the cluster.
pcluster delete <cluster-name-of-choice>