MHPCC DSRC Archive Guide
How to Use This Document
Archiving your code, data, intermediate products and results is an essential part of operating at the DSRCs. Archive is your only means of accessible long-term storage while using its supercomputing resources. Some filesystems are never backed up (e.g., $WORKDIR) while others have size limits or are not backed-up as often as you need (e.g., $HOME). This guide covers how to best make use of archival capabilities in a variety of situations to ensure their efficiency and availability to all users.
Section 1 of this document provides basic information about the archive server, the process of archiving your data, and why this capability is important to you.
Section 2 details important guidelines and precautions for the proper and efficient use of the archive server to ensure maximum availability for all users.
Finally, Section 3 describes methods for building automated processes to tie together your compute jobs with their data retrieval and archival requirements.
Table of Contents
- 1. Archival Basics
- 1.1. Why do I need to archive my data?
- 1.2. How does archival work?
- 1.3. Accessing the archive file system
- 2. Important Guidelines
- 2.1. Use compressed tar files
- 2.2. Do not overwhelm the archive system
- 2.3. Do not directly use files in the archive
- 2.4. Use manifests
- 2.5. Treat important data with appropriate caution
- 3. Data Staging?
- 3.1. What is data staging?
- 3.2. Staging in Compute Queues (Not Supported)
- 3.3. Staging from the Command Line (Manual Staging)
- 3.4. Staging in Transfer Queue Jobs (Batch Staging)
1. Archival Basics
1.1. Why do I need to archive my data?
The short answer is to free up system resources and protect your data.
Your work directory, $WORKDIR, resides on a large temporary file system that is shared with other users. This file system is intended to temporarily hold data that is needed or generated by your jobs. Since user jobs often generate a lot of data, the file system would fill up very quickly if everyone was allowed to just leave their files there indefinitely. This would negatively impact everyone and make the system unusable. To protect the system, an automated purge cycle may run to free up disk space by deleting older or unused files. And, if file space becomes critically low, ALL FILES, regardless of age, are subject to deletion. To avoid this, we strongly encourage you to archive the data you want and keep your $WORKDIR clean by removing unnecessary files. Remember your $WORKDIR is not backed up; so, if your files are purged, and you didn't archive them, they are gone forever!
1.2. How does archival work?
The archive system ($ARCHIVE_HOST) provides a long-term storage area for your important data. It is extremely large, and your personal archive directory ($ARCHIVE_HOME) has no quota. Even so, you probably don't want to archive everything you generate.
When you archive a file, it's copied to your $ARCHIVE_HOME directory on the archive server's disk cache, where it waits to be written to tape by the system. The disk cache is a large temporary storage area for files moving to and from tape. A file in the cache is said to be "online," while a file on tape is "offline." Once your file is written to tape, it may remain "online" for a short time, but eventually it is removed from the disk cache to make room for other files in transit. Both online and offline files show up in a directory listing, but offline files need to be retrieved from tape before you can use them.
Retrieval from tape can take a while, so be patient; there's a lot going on in the background. First, the system must determine on which tape (or tapes) your data resides. These are then robotically pulled from the tape library, mounted in one of the limited number of tape drives (assuming not all of them are busy), and wound into position before retrieval can begin. Your wait time depends on how many files you are retrieving, how big they are, how many tapes they're on, how full the disk cache is, how many other archival jobs are running, the network load (tape-to-cache and cache-to-HPC), and many other factors. After a delay, your data is retrieved from tape and available for use.
1.3. Accessing the archive file system
The archive file system is NFS-mounted to each HPC system, allowing you to perform archival tasks in a familiar Linux environment using standard commands, such as cp, mkdir, chmod, etc. Files can be archived/retrieved simply by copying them to/from $ARCHIVE_HOME.
Important! When interacting with archived files from the command line, please remember, while files may appear to be readily accessible on disk, they may actually be on tape. If so, any action which reads or modifies a file automatically triggers the retrieval of the file from tape, causing the action to take much longer to complete than normal. Such actions include opening, copying, moving, removing, editing, compressing, or tarring a file that is already on tape. See Section 2.3 for more information.
Because the archive file system is a critical resource for all users, please report any errors generated by the archive system to the HPC Help Desk immediately, as they may indicate an issue requiring administrative intervention to resolve.
2. Important Guidelines
These guidelines are important to help safeguard stability of the archive server and to minimize negative impact to all users. Failure to observe these guidelines may result in loss of archival privileges.
2.1. Use compressed tar files
Always tar and compress your files before archiving them. This reduces archival overhead and file size and shortens archival time. The sole exception to this general rule is binary data, which does not always compress well. If you have binary data, you should still combine multiple files using tar, but compression may not be advantageous.
Archival overhead refers to the complex set of time-consuming actions that occur every time you archive or retrieve a file. Some of these actions are described in Section 1.2, but there are others as well. So, if you archive 100 individual files, those time-consuming actions must be performed 100 times. This can really add up. But if you combine those 100 files into a single tar file, those time-consuming actions happen only once.
Technically the only limit on the size of an archived file is the size of the file system disk cache. That said, the size of your archival file matters for three important reasons. The larger the file (1) the higher the likelihood the entire file will be lost if a tape error occurs, (2) the longer it will take to stage back from tape, and (3) the more quickly the file will be removed from the disk after staging.
Be careful not to make your files too big. The optimal tar file size at the MHPCC DSRC is about 10 TB. At that size, the time required for file transfer and tape I/O is still reasonable. Files larger than this are more likely to require the library to load a tape with more free space, greatly increasing archival and retrieval times. If your file is larger than this threshold, you should consider splitting it into multiple smaller files. The maximum recommended file size is 10 TB. You are strongly encouraged not to archive files at or larger than this size.
Also note, using compressed tar files can improve the performance of the archive server for all users because they consume less space on the archive server's disk cache, which benefits everybody. They also reduce your transfer time when moving data to or from the disk cache.
2.2. Do not overwhelm the archive system
Although the archive system provides enormous capacity, it is, in fact, limited in two important ways. The most significant limit is the number of tape drives, which determines the number of tapes that can be read from or written to at once. The second limit is the size of the disk cache, which determines how much data can be online at once.
Attempting to archive or retrieve too many files at once can fill up the disk cache on the archive server, halting archival and staging for all users. Even if the cache does not reach capacity, it could still tie up all available tape drives, impacting other users. To avoid this possibility, if you need to retrieve more than about 20 TB of data or more than about 300 files at once, please contact the HPC Help Desk for assistance.
2.3. Do not directly use files in the archive
A common mistake on systems with NFS-mounted archive file systems is that users
attempt to access the contents of a file, forgetting that it may actually be on
tape. Any attempt to read or use a file that is on tape (e.g., with commands
like
zcat *.tar.gz | tar -tv | grep search_term # DON'T EVER DO THIS!!!
The intent of this command would be to grep through the content listings of multiple compressed tar files for a search term. On a normal file system, this would be no big deal. But on an archive file system, since the zcat command reads the contents of compressed files, it requires the retrieval of every one of the compressed tar files (possibly many files), which could overwhelm the disk cache on the archive server. This is undesirable./p>
If you inadvertently do something like this, cancel the command immediately and contact the HPC Help Desk.
2.4. Use manifests
If you frequently need to search the contents of many tar files, creating and using a manifest file is easier on the archive system and faster for you. To create a manifest for a tar file:
tar -tf file.tar > file.manifest
This can be searched as follows:
grep search_term file.manifest # Search within that file # or grep search_term *.manifest # Search all manifest files in this directory
While it may be convenient to keep the manifest files in the same location as the tar files on the archive server, you could keep them in your home or permanent project directory. This would improve performance even further because you would not need to wait for the manifest files to migrate from tape. Since the manifest files are relatively small, they should not occupy too much space in these locations; however, it is advisable to back up your manifest files periodically.
2.5. Treat important data with appropriate caution
The MHPCC DSRC cannot guarantee against unexpected data loss or corruption. Especially important or irreplaceable data should be stored in a second location at your local facility.
3. Data Staging?
3.1. What is data staging?
Data staging is the process of ensuring your data is in the right place at the right time. Related terms are "staging in" or "pre-staging" and "staging out" or "post-job archival." Before a job can run, the input data must be "staged in" or "pre-staged." This simply means the data is copied from the archive server (or some other source) into a directory accessible by the job script. Archiving your output data after the job completes is called "post-job archival" or "staging out”. "Staging out" may also refer to moving your output data to another location, like the Center-Wide File System ($CENTER) for further processing.
Staging may be performed manually, but since retrieving a file (especially a large file) from tape may take a while, ensuring your input data is in place before your job runs and that it stays there until it runs, isn't always as simple as it sounds. The following sections describe different approaches for staging your data.
3.2. Staging in Compute Queues (Not Supported)
WARNING! DO NOT attempt to stage data to or from the archive server in a job running in any compute queue. The staging attempt WILL FAIL and may consume a significant amount of your allocation before it does. Additionally, failed stage-out attempts may leave your data at risk.
Staging in batch jobs should only be performed in the transfer queue.
3.3. Staging from the Command Line (Manual Staging)
Manual staging is simply staging from the command line without using the transfer queue. For many users, this is the simplest way to do staging because small data sets can usually be transferred while you wait. (Your mileage may vary based on system load.) There are, however, a few things to consider before deciding to stage data manually.
- Check the size of your data first - if your data exceeds 120 GB, you may want to consider staging via the transfer queue. See Section 3.4 for additional details.
- • If your login shell terminates before your transfer completes, your transfer
will die. To avoid this, either use a transfer job
(Section 3.4) or "background
and detach" your transfer command, as follows:
nohup archive get myfile.tar.gz &
3.4. Staging in Transfer Queue Jobs (Batch Staging)
If any of the following apply, you should batch stage your data in the transfer queue:
- If you don't have time to wait for your data to stage
- If you want to submit a job as soon as the input data is staged
- If you want to archive your data as soon as a job completes
Note: Additional examples of all scripts in this guide are also found in the Sample Code Repositories ($SAMPLES_HOME) on the systems.
3.4.1. What is the transfer queue?
The transfer queue is a special-purpose queue for transferring or archiving files. It has access to $HOME, $ARCHIVE_HOME, $WORKDIR, and $CENTER. Jobs running in the transfer queue use non-computational cores and do not accrue time against your allocation.
3.4.2. Staging in via the transfer queue (Pre-staging)
By pre-staging your data in a transfer queue job, you don't have to sit around and wait for your data to be staged before submitting your computational job. The following standalone script demonstrates retrieval of archived data from the archive server, placing it in a newly created directory in your $WORKDIR, whose name is based on the JOBID. Let's call this a "pre-staging job."
Note, all transfer queues are first-come, first-served, regardless of walltime. While you must supply a walltime, it is not used to schedule your transfer queue job. If you set it too low, however, your transfer may die before completion, so you should always use the maximum walltime for transfer queue jobs.
Slurm Example Script
#!/bin/bash #SBATCH --partition=transfer #SBATCH --open-mode=append #SBATCH -o your_filename.out #SBATCH -t 48:00:00 #SBATCH -A Your_Project_ID mkdir ${WORKDIR}/my_job.${SLURM_JOB_ID} # Create unique Job directory cd ${WORKDIR}/my_job.${SLURM_JOB_ID} # cd to unique Job directory # Exit if the archive server is unavailable. mount | grep archive if [ $? -ne 0 ]; then echo "Exiting: `date` - Archive system not on-line!!" exit fi # Retrieve data from archive and extract cp $ARCHIVE_HOME/my_input_data.tgz . tar -xvf my_input_data.tgz
3.4.3. Staging-out via the transfer queue
The term "staging out" refers to the process of dealing with the data that's left in your $WORKDIR after your computational job completes. This generally entails deletion of unneeded files and archival or transfer of important data, which can be time-consuming. Because of this, users can benefit from using the transfer queue for these activities. (Remember jobs in the transfer queue do not consume allocation.) The following standalone scripts demonstrates archival of output data to the archive server via the transfer queue. Let's call this a "stage out job."
Slurm Example Script
#!/bin/bash #SBATCH --partition=transfer #SBATCH --open-mode=append #SBATCH -o your_filename.out #SBATCH -t 48:00:00 #SBATCH -A Your_Project_ID cd ${WORKDIR} # cd to wherever your data is # Exit if the archive server is unavailable. mount | grep archive if [ $? -ne 0 ]; then echo "Exiting: `date` - Archive system not on-line!!" exit fi mkdir $ARCHIVE_HOME/my_job.${SLURM_JOB_ID} # Create unique Job directory # Tar, zip, and archive data. tar -czvf my_output_data.tgz my_output_data cp my_output_data.tgz $ARCHIVE_HOME/my_job.${SLURM_JOB_ID}
3.4.4. Tying it all together
While the previous examples were standalone examples, the following technique creates a 3-step job chain that runs from stage-in to stage-out without any involvement from you. This can be advantageous if your workflow is already well-defined and proven and does not require you to personally analyze your output prior to staging out.
Conceptually, the process looks like this:
If, however, your workflow does require an eyes-on analysis of the output data or if it requires post processing prior to analysis, you may want to use the stage out job instead to transfer your data to $CENTER, as demonstrated in Section 3.4.4.4 (below). You may still submit a transfer queue job later to archive data you want to keep.
3.4.4.1. Script 1 of 3 (Stage-In)
Slurm Example Script
This script contains the stage-in job and launches the compute job.
#!/bin/bash #SBATCH --partition=transfer #SBATCH --open-mode=append #SBATCH -o your_filename.out #SBATCH -t 48:00:00 #SBATCH -A Your_Project_ID mkdir ${WORKDIR}/my_job.${SLURM_JOB_ID} # Create unique Job directory cd ${WORKDIR}/my_job.${SLURM_JOB_ID} # cd to unique Job directory # Exit if the archive server is unavailable. mount | grep archive if [ $? -ne 0 ]; then echo "Exiting: `date` - Archive system not on-line!!" exit fi # Retrieve data from archive and extract cp $ARCHIVE_HOME/my_input_data.tgz . tar -xvf my_input_data.tgz # Submit compute job sbatch ${WORKDIR}/my_compute_script exit
3.4.4.2. Script 2 of 3 (Compute)
Slurm Example Script
This script contains the compute job and launches the stage-out job. Note the use of the $SLURM_SUBMIT_DIR environment variable. This variable is automatically set to the directory in which sbatch is executed in script 1. This script then cd's to that directory before launching its job.
#!/bin/bash #SBATCH --partition=standard #SBATCH --open-mode=append #SBATCH -o your_filename.out #SBATCH -t 96:00:00 #SBATCH -A Your_Project_ID #SBATCH --no-requeue # cd to the job directory that was created in the stage-in script (Script 1) cd ${SLURM_SUBMIT_DIR} ## The following lines show launch commands for the Slurm systems at this center. ## Keep only the line for the system you're running on. # Computation finished. Submit job to pack and archive data sbatch ${WORKDIR}/my_stage-out_script exit
3.4.4.3. Script 3 of 3 (Stage-out to $ARCHIVE_HOME)
Slurm Example Script
This script contains the stage-out job launched by Script 2. Note the use of the $SLURM_SUBMIT_DIR environment variable. This variable is automatically set to the directory in which sbatch is executed in script 1. This script then cd's to that directory before attempting to stage data to $ARCHIVE_HOME.
#!/bin/bash #SBATCH --partition=transfer #SBATCH --open-mode=append #SBATCH -o your_filename.out #SBATCH -t 48:00:00 #SBATCH -A Your_Project_ID # cd to the job directory that was created in the stage-in script (Script 1) cd ${SLURM_SUBMIT_DIR} # Exit if the archive server is unavailable. mount | grep archive if [ $? -ne 0 ]; then echo "Exiting: `date` - Archive system not on-line!!" exit fi mkdir $ARCHIVE_HOME/my_job.${SLURM_JOB_ID} # Create unique Job directory # Tar, zip, and archive data. tar -czvf my_output_data.tgz my_output_data cp my_output_data.tgz $ARCHIVE_HOME/my_job.${SLURM_JOB_ID} exit
3.4.4.4. Alternate Script 3 of 3 (Stage-out to $CENTER)
Slurm Example Script
This script contains the stage-out job launched by Script 2. Note the use of the $SLURM_SUBMIT_DIR environment variable. This variable is automatically set to the directory in which sbatch is executed in script 1. This script then cd's to that directory before attempting to stage data to $CENTER.
#!/bin/bash #SBATCH --partition=transfer #SBATCH --open-mode=append #SBATCH -o your_filename.out #SBATCH -t 48:00:00 #SBATCH -A Your_Project_ID # cd to the job directory that was created in the stage-in script (Script 1) cd ${SLURM_SUBMIT_DIR} # Exit if the Center-Wide file system is unavailable. $CENTER if [ $? -ne 126 ]; then echo "Exiting: `date` - The Center-Wide file system is unavailable!!" exit 3 fi tar cvzf my_output_data.tgz my_output_data mkdir ${CENTER}/my_job.${SLURM_JOB_ID} # Create unique Job directory cp my_output_data.tgz ${CENTER}/my_job.${SLURM_JOB_ID} exit