U.S. - Dallas Infrastructure outage - Incident details - Prism

DAL-1-PG Outage Postmortem

All times in MST

Background:

After observing increasing Java Runtime error rates, likely caused by faulty hardware (potentially a bad RAM stick), we began monitoring our Dallas premium node DAL-1-PG more thorougly. We had already scheduled maintenance to perform a memory swap.

11:45 – Our alerting systems notified us of an outage on the node.

11:46 – IPMI logs showed that the node had crashed and failed to automatically reboot due to file system errors.

11:48 - 12:00 – We verified the integrity of previous backups and considered recovery options, such as restoring the filesystem from offsite snapshots to another server and redirecting IPs, or engaging our hardware vendor to replace a potentially faulty drive. Unsure if a bad RAM stick caused storage corruption or vice versa, we leveraged RAID to run fsck and repair the array.

~12:30 – The RAID array repair completed.

12:30 - 12:35 – Validated the integrity of client files.

12:40 – The node successfully booted up and services were restored.

We currently do not have the necessary hardware readily available to seamlessly transfer all workloads in the event of a complete failure, especially since we are not entirely certain which component caused the issue. Running diagnostic tools like memtest on high-capacity systems can take eight hours or more, which would require having an entire spare machine ready to go. At this time, our monitoring indicates that everything is operating normally. However, if we detect any further issues, we will arrange to have a new node shipped overnight.

Prism - U.S. - Dallas Infrastructure outage – Incident details

All systems operational

U.S. - Dallas Infrastructure outage