AFRL DSRC Introductory Site Guide
Table of Contents
- 1. Introduction
- 1.1. Purpose of this document
- 1.2. About the AFRL DSRC
- 1.3. Whom our services are for
- 1.4. How to get an account
- 1.5. Visiting the AFRL DSRC
- 2. Policies
- 2.1. Baseline Configuration (BC) policies
- 2.2. Login node abuse policy
- 2.3. File space management policy
- 2.4. Maximum session lifetime policy
- 2.5. Batch use policy
- 2.6. Special request policy
- 2.7. Account removal policy
- 2.8. Communications policy
- 2.9. System availability policy
- 2.10. Data import and export policy
- 2.11. Account sharing policy
- 3. Available resources
- 3.1. HPC systems
- 3.2. Data storage
- 3.3. Computing environment
- 3.4. HPC Portal
- 3.5. Secure Remote Desktop (SRD)
- 3.6. Network connectivity
- 4. How to access our systems
- 5. How to get help
- 5.1. User Productivity Enhancement and Training (PET)
- 5.2. User Advocacy Group (UAG)
- 5.3. Baseline Configuration Team (BCT)
- 5.4. Computational Research and Engineering Acquisition Tools and Environments (CREATE)
- 5.5. Data Analysis and Assessment Center (DAAC)
1. Introduction
1.1. Purpose of this document
This document introduces users to the U.S. Air Force Research Laboratory (AFRL) DoD Supercomputing Resource Center (DSRC). It provides an overview of available resources, links to relevant documentation, essential policies governing the use of our systems, and other information to help you make efficient and effective use of your allocated hours.
1.2. About the AFRL DSRC
The AFRL DSRC is one of five DSRCs managed by the DoD High Performance Computing Modernization Program (HPCMP). The DSRCs deliver a range of compute-intensive and data-intensive capabilities to the DoD science and technology, test and evaluation, and acquisition engineering communities. Each DSRC operates and maintains major High Performance Computing (HPC) systems and associated infrastructure, such as data storage, in both unclassified and classified environments. The HPCMP provides user support through a centralized help desk and data analysis/visualization group.
The AFRL DSRC is an HPC facility committed to providing the resources necessary for DoD scientists and engineers to complete their research, development, testing, and evaluation projects. Since our inception in 1996 as part of the HPCMP, we have supported the warfighter by combining powerful computational resources, secure interconnects, and application software with renowned services, expertise, and experience.
1.3. Whom our services are for
The HPCMP's services are available to Service and Agency researchers in the Research, Development, Test, and Evaluation (RDT&E) and acquisition engineering communities of the DoD and its respective DoD contractors, and University staff working on a DoD research grant.
For more details, see HPCMP Presentation " Who may run on HPCMP Resources?"
1.4. How to get an account
Anyone meeting the above criteria may request an HPCMP account. An HPC Help Desk video is available to guide you through the process of getting an account. To begin the account application process, visit the Obtaining an Account page and follow the instructions presented there.
1.5. Visiting the AFRL DSRC
If you need to travel to the AFRL DSRC, there are security procedures that must be completed BEFORE planning your trip. Please see our Visit section and coordinate with your Service/Agency Approval Authority (S/AAA) to ensure all requirements are met.
2. Policies
2.1. Baseline Configuration (BC) policies
The Baseline Configuration Team sets policies that apply to all HPCMP HPC systems. The BC Policy Compliance Matrix provides an index of all BC policies and compliance status of systems at each DSRC.
2.2. Login node abuse policy
Interactive usage of each of the AFRL DSRC HPC systems is restricted to 15 minutes of process time per core. Any interactive job/process that exceeds this 15-minute-per-core time limit is automatically terminated by system monitoring software. Interactive usage on AFRL DSRC HPC systems should be limited to items such as: program development, including debugging and performance improvement, job preparation, job submission, and the preprocessing and post-processing of data.
AFRL DSRC HPC systems have been tuned for optimal performance in a batch system environment. Excessive interactive usage causes overloading of these systems and leads to considerable degradation of system performance.
2.3. File space management policy
Close management of workspace is a priority of the AFRL DSRC. Each user is provided the following workspace:
$WORKDIR: a "scratch" file system available on each HPC system.
$CENTER: the center-wide file system accessible to all center production machines.
Neither of these file systems is backed up. You are responsible for managing files in your $WORKDIR and $CENTER directories by backing up files to the archive system and deleting unneeded files. Currently, $WORKDIR files older than 21 days and $CENTER files older than 120 days are subject to being purged.
If it is determined as part of the normal purge cycle that files in your $WORKDIR directory must be deleted, we will notify you via email 6 days prior to deletion.
2.4. Maximum session lifetime policy
To provide users with a more secure high performance computing environment, the AFRL DSRC has implemented a limit on the lifetime of all terminal/window sessions. Any idle terminal or window session connections to the AFRL DSRC are terminated after 10 hours. Regardless of activity, any terminal or window session connections to the AFRL DSRC are terminated after 20 hours. A 15-minute warning message is sent to each such session prior to its termination.
2.5. Batch use policy
A single user can run up to 1/2 of the advertised number of system cores on a system in one or more jobs. Once the 1/2 of the system limit is reached, additional jobs do not start or accrue eligible time until some of their running jobs finish. The standard limit for run time of jobs is 168 hours. Upon special request, jobs with run times up to 336 hours are permitted, but the users run at risk.
In the case where a system has specialty nodes, large-memory nodes, GPGPU nodes, etc., the scheduler can reserve the specialty nodes for jobs that require them.
Due to limitations in resource checking, the scheduler allocates cores and memory at the node level. If the requested number of cores and memory is not an even multiple of cores or memory in a node, the resource requests are increased to the next higher number of nodes. The user is charged for the use of the entire number of nodes used.
Although every attempt is made to keep entire systems available, interrupts can occur, and more frequently on systems with larger numbers of nodes. Users should use mechanisms to save the state of their jobs where available (most AFRL DSRC-supported applications can create restart files so runs do not have to start from the beginning) to protect against system interrupts. Users running long jobs without saving the state of their jobs run at-risk with respect to system interrupts. Use of system-level check pointing is not recommended.
All HPC systems have identical queue names: urgent, debug, HIE, high, frontier, standard, transfer, and background; however, each queue has different properties as specified in the table below. Each of these queues is assigned a priority factor within the batch system. The relative priorities of the queues are shown in the table. In addition, jobs requesting more cores have an additional increase in overall priority relative to the number of cores. Jobs in queues other than background accrue additional priority based on time in queue. The scheduling of jobs uses job slot-reservation based on these priority factors and increases system utilization via backfilling while waiting for resources to become available.
Priority | Queue Name | Max Wall Clock Time | Max Cores Per Job | Description |
---|---|---|---|---|
Highest | urgent | 168 Hours | 1/2 of the system | Jobs belonging to DoD HPCMP Urgent Projects |
debug | 1 Hour | Time/resource-limited for user testing and debug purposes | ||
high | 168 Hours | 1/2 of the system | Jobs belonging to DoD HPCMP High Priority Projects | |
frontier | 168 Hours | 1/2 of the system | Jobs belonging to DoD HPCMP Frontier Projects | |
standard | 168 Hours | 1/2 of the system | Standard jobs | |
HIE | 24 Hours | 2 nodes | Rapid response for interactive work | |
transfer | 48 Hours | 1 core | Data transfer for user jobs | |
Lowest | background | 120 Hours | 1 node | Jobs that are not charged against the project allocation |
In conjunction with the HPCMP Baseline Configuration policy for Common Queue Names across the allocated centers, the AFRL DSRC honors batch jobs that include the queue name for urgent, high (high priority), and frontier. AFRL assigns the queue by the project number so if the queue name is omitted the job is assigned to the correct queue. Note: If the project number does not match the queue, the job runs in the queue defined by the project number and not by the queue name selected.
Any project with an allocation may submit jobs to the debug, HIE, transfer, or background queue. Projects that have exhausted their allocations are only able to submit jobs to the background queue. A background job cannot start if there is a foreground job waiting in any queue.
If any job attempts to use more resources than were specified when the job was submitted for batch processing, the scheduling system automatically kills the job.
2.6. Special request policy
All special requests for allocated HPC resources, including increased priority within queues, increased queue parameters for maximum number of cores and Wall Time, and dedicated use, should be directed to the HPC Help Desk. Requesting approval requires documentation of the requirement and associated justification, verification by the AFRL DSRC support staff, and approval from the designated authority, as shown in the following table. The AFRL DSRC Director may permit special requests for HPC resources independent of this model for exceptional circumstances.
Resource Request | Approval Authority |
---|---|
Up to 10% of an HPC system/complex for 1 week or less | AFRL DSRC Director or Designee |
Up to 20% of an HPC system/complex for 1 week or less | S/AAA |
Up to 30% of an HPC system/complex for 2 weeks or less | Army/Navy/AF Service Principal on HPC Advisory Panel |
Up to 100% of an HPC system/complex for greater than 2 weeks | HPCMP Program Director or Designee |
2.7. Account removal policy
AFRL is fully compliant with Baseline Configuration (BC) policy FY13-02 (Data Removal at account closure).
2.8. Communications policy
AFRL is fully compliant with Baseline Configuration (BC) policy FY06-11 (Announcing and Logging Changes). It is the user's responsibility to ensure his/her contact information is current in the Portal to the Information Environment (pIE). If your information is not current, please contact your S/AAA.
2.9. System availability policy
A system is declared down and made unavailable to users whenever a chronic and/or catastrophic hardware and/or software malfunction or an abnormal computer environment condition exists which could:
- Result in corruption of user data.
- Result in unpredictable and/or inaccurate runtime results.
- Result in a violation of the integrity of the DSRC user environment.
- Result in damage to the High Performance Computer System(s).
The integrity of the user environment is considered corrupt anytime a user must modify his/her normal operation while logged into the DSRC. Examples of malfunctions are:
- User home ($HOME) directory not available.
- User Workspace ($WORKDIR, $JOBDIR) areas not available.
- If the archive system is unavailable, queues are suspended, but logins are enabled.
When a system is declared down, based on a system administrator's and/or computer operator's judgment, users are prevented from using the affected system(s) and all existing batch jobs are prevented from running. Batch jobs held during a "down state" are run only after the system environment returns to a normal state.
Whenever there is a problem on one of the HPC systems that could be remedied by removing a part of the system from production (an activity called draining), it must first be determined how much of the system will be impacted by the draining to brief the necessary levels of management and the user community.
If the architecture of the HPC system allows a node to be removed from production with minimal impact to the system, the system administrators can make the decision to remove the node with notification to the operators for information. This typically pertains to cluster system architectures. In some cases, large SMP systems allow individual CPUs to be downed, and the administrator can determine this and notify operations for information.
If the architecture of the HPC system allows significant portions of the system to be removed from production and still allow user production on a large part of the system to continue, then the system administrator along with government and contractor management can make the decision to remove that part of the system. The system should show that domain or SMP node as out of the normal queue for scheduling jobs so the user community can determine the current status. The system administrator will advise operations and the HPC Help Desk of this action.
In cases where $WORKDIR will be unavailable, or a complete system needs to be drained for maintenance, contractor and government director-level management are notified. In cases involving an entire system, the HPC Help Desk emails users of the downtime schedule and schedule for returning the system to production.
2.10. Data import and export policy
This policy outlines the methods available to users to move files into and out of the AFRL DSRC environment. Users accept sole responsibility for the transfer and validation of their data after the transfer.
For cross center data transfers, AFRL is fully compliant with Baseline Configuration (BC) policy FY06-14 (Cross-Center File Transfers). For transfers between AFRL and remote locations, see the guidance below.
2.10.1. Network file transfers
The preferred transfer method is over the network using the encrypted (Kerberos) file transfer programs scp or sftp. In cases where large numbers of files (> 1000) and/or large amounts of data (> 100 GB) must be transferred, users should contact the HPC Help Desk for assistance in the process.
2.10.2. Reading/Writing media
There are currently no facilities or provisions available to import or export user data on tape with the resources of the mass storage/archival system.
AFRL Staff accept hard disk media for use in transferring large amounts of data. Disks are scanned and then mounted to local systems so AFRL Staff can load or unload data. The disk is then returned to the user. Please contact the HPC Help Desk for assistance in this process.
2.11. Account sharing policy
Users are responsible for all passwords, accounts, YubiKeys, and associated PINs issued to them. Users must not share their passwords, accounts, YubiKeys, or PINs with any other individual for any reason. Doing so is a violation of the contract users are required to sign to obtain access to DoD High Performance Computing Modernization Program (HPCMP) computational resources.
Upon discovery/notification of a violation of the above policy, the following actions are taken:
- The account (i.e., username) is disabled. No further logins are permitted.
- All account assets are frozen. File and directory permissions are set so no other users can access the account assets.
- Any executing jobs are permitted to complete; however, any jobs residing in input queues are deleted.
- The Service/Agency Approval Authority (S/AAA) who authorized the account is notified of the policy violation and the actions taken.
Upon the first occurrence of a violation of the above policy, the S/AAA has the authority to request the account be re-enabled. Upon the occurrence of a second or subsequent violation of the above policy, the account is only re-enabled if the user's supervisory chain of command, S/AAA, and the High Performance Computing Modernization Office (HPCMO) all agree the account should be re-enabled.
The disposition of account assets is determined by the S/AAA. The S/AAA can:
- Request account assets be transferred to another account.
- Request account assets be returned to the user.
- Request account assets be deleted, and the account closed.
If there are associate investigators who need access to AFRL DSRC computer resources, we encourage them to apply for an account. Separate account holders may access common project data as authorized by the project leader.
3. Available resources
3.1. HPC systems
The AFRL DSRC unclassified HPC systems are accessible through the Defense Research and Engineering Network (DREN) to all active users. Our current HPC systems include:
Raider is a Penguin Computing TrueHPC system located at the AFRL DSRC. It has 1,480 standard compute nodes, 8 large-memory nodes, and 24 Visualization nodes, 32 MLA nodes, and 64 High Clock nodes (a total of 199,680 compute cores). It has 447 TB of memory and is rated at 9 peak PFLOPS.
See the Systems page for more information about Raider.
Warhawk is an HPE Cray EX system located at the AFRL DSRC. It has 1,024 standard compute nodes, 4 large-memory nodes, 24 1-GPU visualization nodes, and 40 2-GPU Machine-Learning nodes (a total of 1,092 compute nodes or 139,776 compute cores). It has 564 TB of memory and is rated at 6.86 peak PFLOPS.
See the Systems page for more information about Warhawk.
3.2. Data storage
3.2.1. File systems
Each HPC system has several file systems available for storing user data. Your personal directories on these file systems are commonly referenced via the $HOME, $WORKDIR, $CENTER, and $ARCHIVE_HOME environment variables. Other file systems may be available as well.
Environment Variable | Description |
---|---|
$HOME | Your home directory on the system |
$WORKDIR | Your temporary work directory on a high-capacity, high-speed scratch file system used by running jobs |
$CENTER | Your short-term (120-day) storage directory on the Center-Wide File System (CWFS) |
$ARCHIVE_HOME | Your archival directory on the archive server |
For details about the specific file systems on each system, see the system user guides on the AFRL DSRC Documentation page.
3.2.2. Archive system
All our HPC systems have access to an online archival system, $ARCHIVE_HOST, which provides long term storage for users' files on a petascale robotic tape library system. A 100-TB disk cache frontends the tape file system and temporarily holds files while they are being transferred to or from tape.
For information on using the archive server, see the AFRL DSRC Archive Guide.
3.3. Computing environment
To ensure a consistent computing environment and user experience on all HPCMP HPC systems, all systems follow a standard configuration baseline. For more information on the policies defining the baseline configuration, see the Baseline Configuration Compliance Matrix. All systems run variants of the Linux operating system, but the computing environment varies by vendor and architecture due to vendor-specific enhancements.
3.3.1. Software
Each HPC system hosts a large variety of compiler environments, math libraries, programming tools, and third-party analysis applications which are available via loadable software modules. A list of software is available on the Software page, or for more up-to-date software information, use the module commands on the HPC systems. Specific details of the computing environment on each HPC system are discussed in the system user guides, available on the AFRL DSRC Documentation page.
To request additional software or to request access to restricted software, please contact the HPC Help Desk at help@helpdesk.hpc.mil.
3.3.2. Bring your own code
While all HPCMP HPC systems offer a diversity of open source, commercial and government software, there are times when we don't support the application codes and tools needed for specific projects. The following information describes a convenient way to utilize your own software on our systems.
Our HPC systems provide you with adequate file space to store your codes. Data stored in your home directory ($HOME) is backed up on a periodic basis. If you need more home directory space, you may submit a request to the HPC Help Desk at help@helpdesk.hpc.mil. For more details on home directories, see to the Baseline Configuration (BC) policy FY12-01 (Minimum Home Directory Size and Backup Schedule).
If you need to share an application among multiple users, BC policy FY10-07 (Common Location to Maintain Codes) explains how to create a common location on the $PROJECTS_HOME file system, to place applications and codes without using home directories or scrubbed scratch space. To request a new "project directory," please provide the following information to the HPC Help Desk:
- Desired DSRC system where a project directory is being requested.
- POC Information: Name of the sponsor of the project directory, username, and contact information.
- Short Description of Project: Short summary of the project describing the need for a project directory.
- Desired Directory Name: This is the name of the directory created under $PROJECTS_HOME.
- Is the code/data in the project directory restricted (e.g., ITAR, etc.)?
- Desired Directory Owner: The username to be assigned ownership of the directory.
- Desired Directory Group: The group name to be assigned to the directory.
(New group names must be eight characters or less) - Additional users to be added to the group.
If the POC for the project directory ceases being an account holder on the system, project directories are handled according to the user data retention policies of the center.
Once the project directory is created, you can install software (custom or open source) in this directory. Then, depending on requirements, you can set file and/or directory permissions to allow any combination of group read, write, and execute privileges. Since this directory is fully owned by the POC, s/he can even make use of different groups within subdirectories to provide finer granularity of permissions.
Users are expected to ensure that any software or data placed on HPCMP systems is protected according to any external restrictions on the data. Users are also responsible for ensuring no unauthorized or malicious software is introduced to the HPCMP environment.
For installations involving restricted software, it is your responsibility to set up group permissions on the directories and protect the data. It is crucially important to note that there are users on the HPCMP systems who are not authorized to access restricted data. You may not run servers or use software that communicates to a remote system without prior authorization.
If you need help porting or installing your code, the HPC Help Desk provides a "Code Assist" team that specializes in helping users with installation and configuration issues for user supplied codes. To get help, simply contact the HPC Help Desk and open a ticket.
Please contact the HPC Help Desk to discuss any special requirements.
3.3.3. Batch schedulers
Our HPC systems use various batch schedulers to manage user jobs and system resources. Basic instructions and examples for using the scheduler on each system can be found in the system user guides. More extensive information can be found in the Scheduler Guides. These documents are available on the AFRL DSRC Documentation page.
Schedulers place user jobs into different queues based on the project associated with the user account. Most users only have access to the debug, standard, transfer, HIE, and background queues, but other queues may be available to you depending on your project. For more information about the queues on a system, see the Scheduler Guides.
3.3.4. Advance Reservation Service (ARS)
Another way to schedule jobs is through the ARS. This service allows users to reserve resources for use at specific times and for specific durations. The ARS works in tandem with the batch scheduler to ensure your job runs at the scheduled time and that all required resources (i.e., nodes, licenses, etc.) are available when your job begins. For information on using the ARS, see the ARS User Guide.
3.4. HPC Portal
The HPC Portal provides a suite of custom web applications, allowing you to access a command line, manage files, and submit and manage jobs from a browser. It also supports pre/post-processing and data visualization by making DSRC-hosted desktop applications accessible over the web. For more information about the HPC Portal, see the HPC Portal page.
3.5. Secure Remote Desktop (SRD)
The SRD enables users to launch a gnome desktop on an HPC system via a downloadable Java interface client. This desktop is then piped to the user's local workstation (Linux, Mac, or Windows) for display. Once the desktop is launched, you can run any software application installed on the HPC system. For information on using SRD or to download the client, see the Secure Remote Desktop page on the DAAC website.
3.6. Network connectivity
The AFRL DSRC is a primary node on the Defense Research and Engineering Network (DREN), which provides up to 10-Gb/sec service to DoD HPCMP centers nationwide across a 100-Gb/sec backbone. We connect to the DREN via a 10-Gb/sec circuit linking us to the DREN backbone.
The DSRC's local network consists of a 100-Gb/sec fault-tolerant backbone with up to 40-Gb/sec connections to the HPC and archive systems.
4. How to access our systems
The HPCMP uses a network authentication protocol called Kerberos to authenticate user access to our HPC systems. Before you can login, you must download and install an HPCMP Kerberos client kit on your local system. For information about downloading and using these kits, visit the Kerberos & Authentication page and click on the tab for your platform. There you will find instructions for downloading and installing the kit, getting a ticket, and logging in.
After installing and configuring a Kerberos client kit, you can access our HPC systems via standard Kerberized commands, such as ssh. File transfers between local and remote systems can be accomplished via the scp, mpscp, or scampi commands. For additional information on using the Kerberos tools, see the Kerberos User Guide or review the tutorial video on Logging into an HPC System. Instructions for logging into each system can be found in the system user guides on the AFRL DSRC Documentation page.
Another way to access the HPC systems is through the HPC Portal. For information on using the portal, visit the HPC Portal page. To log into the portal, click on the link for the center where your account is located.
5. How to get help
For almost any issue, the first place you should turn for help is the HPC Help Desk. You can email the HPC Help Desk at help@helpdesk.hpc.mil. You can also contact the HPC Help Desk via phone, DSN, or even traditional mail. Full contact information for the Help Desk is on the Technical and Customer Support page. The HPC Help Desk can assist with a wide array of technical issues related to your account and your use of our systems. The HPC Help Desk can also assist in connecting you with various special-purpose groups to address your particular need.
5.1. User Productivity Enhancement and Training (PET)
The PET initiative gives users access to computational experts in many HPC technology areas. These HPC application experts help HPC users become more productive using HPCMP supercomputers. The PET initiative also leverages the expertise of academia and industry experts in new technologies and provides training on HPC-related topics. Help in specific computational technology areas is available providing a wide range of expertise including algorithm development and implementation, code porting and development, performance analysis, application and I/O optimization, accelerator programming, preprocessing and grid generation, workflows, in-situ visualization, and data analytics.
To learn more about PET, see the Advanced User Support page. To request PET assistance, send an email to PET@hpc.mil.
5.2. User Advocacy Group (UAG)
The UAG provides a forum for users of HPCMP resources to influence policies and practices of the Program; to facilitate the exchange of information between the user community and the HPCMP; to serve as an advocate for HPCMP users; and to advise the HPC Modernization Program Office on policy and operational matters related the HPCMP.
To learn more about the UAG, see the User Advocacy Group page (PKI required). To contact the UAG, send an email to hpc-uag@hpc.mil.
5.3. Baseline Configuration Team (BCT)
The BCT defines a common set of capabilities and functions so users can work more productively and collaboratively when using the HPC resources at multiple computing centers. To accomplish this, the BCT passes policies which collectively create a configuration baseline for all HPC systems.
To learn more about the BCT and its policies, see the Baseline Configuration page. To contact the BCT, send an email to BCTinput@afrl.hpc.mil.
5.4. Computational Research and Engineering Acquisition Tools and Environments (CREATE)
The CREATE program provides tools to enhance the productivity of the DoD acquisition engineering workforce by providing high fidelity design and analysis tools with capabilities greater than today's tools, reducing the acquisition development and test process cycle. CREATE projects provide enhanced engineering design tools for the DoD HPC community.
To learn more about CREATE, visit the CREATE page or contact the CREATE Program Office at create@hpc.mil. You may also access the CREATE Community site (Registration and PKI required).
5.5. Data Analysis and Assessment Center (DAAC)
The DAAC serves the needs of DoD HPCMP scientists to analyze an ever-increasing volume and complexity of data. Its mission is to put visualization and analysis tools and services into the hands of every user.
For more information about DAAC, visit the DAAC website. To request assistance from DAAC, send an email to support@daac.hpc.mil.