jm + power + via:fanf   2

A SPOF UPS. There was a similar AZ-wide outage in one of the Amazon DUB datacenters with a similar root cause, if I recall correctly -- supposedly redundant dual UPS systems were in fact interdependent, in that case, and power supply switchover wasn't clean enough to avoid affecting the servers.
Minutes later power was restored was resumed in what one source described as “uncontrolled fashion.” Instead of gradual restore, all power was restored at once resulting in a power surge.   BA CEO Cruz told BBC Radio this power surge  caused network hardware to fail. Also server hardware was damaged because of the power surge.

It seems as if the UPS was the single point of failure for power feed of the IT equipment in Boadicea House . The Times is reporting that the same UPS was powering both Heathrow based datacenters. Which could be a double single point of failure if true (I doubt it is)

The broken network  stopped the exchange of messages between different BA systems and application. Without messaging, there is no exchange of information between various applications. BA is using Progress Software’s Sonic [enterprise service bus].

(via Tony Finch)
postmortems  ba  airlines  outages  fail  via:fanf  datacenters  ups  power  progress  esb  j2ee 
may 2017 by jm
Google's Pegasus
a power-management subsystem for warehouse-scale computing farms. "It adjusts the power-performance settings of servers so that the overall workload barely meets its latency constraints for user queries."
pegasus  power-management  power  via:fanf  google  latency  scaling 
june 2014 by jm

Copy this bookmark: