Prism - U.S. - Dallas Infrastructure outage – Incident details

U.S. - Dallas Infrastructure outage

Resolved
Major outage
Started 5 months agoLasted about 1 hour

Affected

US - Dallas Infrastructure

Operational from 6:46 AM to 7:42 AM

Updates
  • Resolved
    Resolved

    DAL-1-PG Outage Postmortem

    All times in MST

    Background:

    After observing increasing Java Runtime error rates, likely caused by faulty hardware (potentially a bad RAM stick), we began monitoring our Dallas premium node DAL-1-PG more thorougly. We had already scheduled maintenance to perform a memory swap.

    11:45 – Our alerting systems notified us of an outage on the node.

    11:46 – IPMI logs showed that the node had crashed and failed to automatically reboot due to file system errors.

    11:48 - 12:00 – We verified the integrity of previous backups and considered recovery options, such as restoring the filesystem from offsite snapshots to another server and redirecting IPs, or engaging our hardware vendor to replace a potentially faulty drive. Unsure if a bad RAM stick caused storage corruption or vice versa, we leveraged RAID to run fsck and repair the array.

    ~12:30 – The RAID array repair completed.

    12:30 - 12:35 – Validated the integrity of client files.

    12:40 – The node successfully booted up and services were restored.

    We currently do not have the necessary hardware readily available to seamlessly transfer all workloads in the event of a complete failure, especially since we are not entirely certain which component caused the issue. Running diagnostic tools like memtest on high-capacity systems can take eight hours or more, which would require having an entire spare machine ready to go. At this time, our monitoring indicates that everything is operating normally. However, if we detect any further issues, we will arrange to have a new node shipped overnight.

  • Investigating
    Investigating

    Node DAL-1-PG has gone offline, likely because of a memory failure.