Sunday Outage Sunday

July 14, 2008 Ian Hamilton Technology

We experienced a hardware malfunction yesterday that caused the TOPAZ hosted journals to be offline from 4pm – 10pm PST.

James sent an email late afternoon yesterday indicating site errors on the PLoS journal websites. Soon after his email, the IT team started receiving SMS alerts. I assumed that something had occurred with the Topaz framework and started looking at the appropriate log files but couldn’t find anything. I spent the requisite amount of time banging my head against the “site error” wall without success and I called up Russ for assistance. After a bit of digging through the server logs, he found the culprit – a drive had failed on the Mulgara server. This drive is part of a RAID 5 configuration, so we didn’t lose any data but we also mysteriously lost the connection from the Mulgara server to the DAS array (disk storage for the Mulgara data).

We restarted the server but couldn’t confirm that it was rebuilding the RAID correctly. I drove down to the colo, confirmed the drive failure and babysat the server until the platform was healthy. We’ll swap out the defective drive on Wednesday during the migration to a pre-release of Topaz 0.9.

In case you missed the reference to U2’s Sunday Bloody Sunday….