System Status User Notices

BC Project: FY06-10
Date of Policy: 16 Apr 2007
Last Updated: 29 Nov 2021 (see Revision Log)

HPCMP centers shall notify users of events or situations which affect significant numbers of their users. Such situations include, but are not limited to:

  • Scheduled and unscheduled system downtime
  • Emergency maintenance
  • Disasters
  • Degraded performance or functionality
  • Any issue that a center deems urgent

In the event of Scheduled maintenance, users should be notified at least 4 days prior to the start of the downtime for each day of planned outage. For example a planned 2 day outage should be announced 8 days in advance.

In the event of unscheduled/emergency system downtime or degraded performance/functionality, users should be notified immediately after the system is determined to be down or impacted. A follow-up notification containing more specifics, such as cause and expected return to service date/time, can be sent when these specifics become available. In the event of any type of downtime or degradation, a notification informing users of system availability should be sent out as soon as the system has been returned to production.

Delivery should occur through multiple channels including Email, Message of the Day, and Web Posting to provide the widest possible coverage.

Email

Notification must be sent to:

  • All enabled users of the system or center.
  • All S/AAAs with projects on the system or at the center.
  • HPC Help Desk.

Centers must establish and maintain email addresses of users and the S/AAA's, and have this information readily available. Membership in these lists is mandatory.

Message of The Day (MOTD)

Entry created in applicable system message of the day (MOTD).

Web Posting

Message posted on the center's website (e.g., Use the Web team's News and Maintenance Tool (NAMT)).

Centers shall follow established HPCMP and DoD security policies and procedures concerning information dissemination, especially regarding security bulletins and messages posted to public websites.

In the event of disaster or loss of connectivity, a center can ask the HPC Help Desk to send email and post NAMT messages on behalf of the center.

Message Content

To ensure all system issues that affect users are reported consistently, the following will be included in user notices:

  • Use the classification term DOWN to document when a system(s) is fully down.
  • Use the classification term DEGRADED to document operational system issues that affect users but do not fall into the DOWN classification.
  • Use the color YELLOW to denote DEGRADED.
  • Implement TEMPLATE for STANDARD USER NOTICES in which the subsystems which are and are not affected are listed. The following is an example template;

    Impacted System: Users are currently experiencing problems with Narwhal.

    Impact: DOWN or DEGRADED PERFORMANCE or FUNCTIONALITY

    Date/Time Issue Reported: 11/3/2020 11:18:06 AM

    Date/Time Issue Resolved:

    1. A network outage is blocking external access at this time. Users cannot log into the system. Users cannot submit new jobs to the queue.
    2. Previously submitted jobs will start, existing jobs are still running, Job access to workspace and archive are still functional.

Revision Log
Date Revision
29 Nov 2021BC Team Audit
24 Feb 2021Added definition of "Degraded" and added the template message
26 Apr 2018BC Team Audit
02 Jun 2016BC Team Audit
17 Apr 2014BC Team Audit
12 Sep 2012Emphasized importance of communicating with users whenever unscheduled maintenance or problem occurs
23 Mar 2012BC Team Audit
13 Nov 2008Added file containing blocked ports at participating sites
29 Nov 2007Changed link to Kirby, added other systems