System Status User Notices
BC Project: FY06-10
Date of Policy: 16 Apr 2007
Last Updated: 29 Nov 2021 (see Revision Log)
HPCMP centers shall notify users of events or situations which affect significant numbers of their users. Such situations include, but are not limited to:
- Scheduled and unscheduled system downtime
- Emergency maintenance
- Disasters
- Degraded performance or functionality
- Any issue that a center deems urgent
In the event of Scheduled maintenance, users should be notified at least 4 days prior to the start of the downtime for each day of planned outage. For example a planned 2 day outage should be announced 8 days in advance.
In the event of unscheduled/emergency system downtime or degraded performance/functionality, users should be notified immediately after the system is determined to be down or impacted. A follow-up notification containing more specifics, such as cause and expected return to service date/time, can be sent when these specifics become available. In the event of any type of downtime or degradation, a notification informing users of system availability should be sent out as soon as the system has been returned to production.
Delivery should occur through multiple channels including Email, Message of the Day, and Web Posting to provide the widest possible coverage.
Notification must be sent to:
- All enabled users of the system or center.
- All S/AAAs with projects on the system or at the center.
- HPC Help Desk.
Centers must establish and maintain email addresses of users and the S/AAA's, and have this information readily available. Membership in these lists is mandatory.
Message of The Day (MOTD)
Entry created in applicable system message of the day (MOTD).
Web Posting
Message posted on the center's website (e.g., Use the Web team's News and Maintenance Tool (NAMT)).
Centers shall follow established HPCMP and DoD security policies and procedures concerning information dissemination, especially regarding security bulletins and messages posted to public websites.
In the event of disaster or loss of connectivity, a center can ask the HPC Help Desk to send email and post NAMT messages on behalf of the center.
Message Content
To ensure all system issues that affect users are reported consistently, the following will be included in user notices:
- Use the classification term DOWN to document when a system(s) is fully down.
- Use the classification term DEGRADED to document operational system issues that affect users but do not fall into the DOWN classification.
- Use the color YELLOW to denote DEGRADED.
- Implement TEMPLATE for STANDARD USER NOTICES in which the subsystems which are and are
not affected are listed. The following is an example template;
Impacted System: Users are currently experiencing problems with Narwhal.
Impact: DOWN or DEGRADED PERFORMANCE or FUNCTIONALITY
Date/Time Issue Reported: 11/3/2020 11:18:06 AM
Date/Time Issue Resolved:
- A network outage is blocking external access at this time. Users cannot log into the system. Users cannot submit new jobs to the queue.
- Previously submitted jobs will start, existing jobs are still running, Job access to workspace and archive are still functional.
Date | Revision |
---|---|
29 Nov 2021 | BC Team Audit |
24 Feb 2021 | Added definition of "Degraded" and added the template message |
26 Apr 2018 | BC Team Audit |
02 Jun 2016 | BC Team Audit |
17 Apr 2014 | BC Team Audit |
12 Sep 2012 | Emphasized importance of communicating with users whenever unscheduled maintenance or problem occurs |
23 Mar 2012 | BC Team Audit |
13 Nov 2008 | Added file containing blocked ports at participating sites |
29 Nov 2007 | Changed link to Kirby, added other systems |