jm + datacenters   11

WHAT WENT WRONG IN BRITISH AIRWAYS DATACENTER IN MAY 2017?
A SPOF UPS. There was a similar AZ-wide outage in one of the Amazon DUB datacenters with a similar root cause, if I recall correctly -- supposedly redundant dual UPS systems were in fact interdependent, in that case, and power supply switchover wasn't clean enough to avoid affecting the servers.
Minutes later power was restored was resumed in what one source described as “uncontrolled fashion.” Instead of gradual restore, all power was restored at once resulting in a power surge.   BA CEO Cruz told BBC Radio this power surge  caused network hardware to fail. Also server hardware was damaged because of the power surge.

It seems as if the UPS was the single point of failure for power feed of the IT equipment in Boadicea House . The Times is reporting that the same UPS was powering both Heathrow based datacenters. Which could be a double single point of failure if true (I doubt it is)

The broken network  stopped the exchange of messages between different BA systems and application. Without messaging, there is no exchange of information between various applications. BA is using Progress Software’s Sonic [enterprise service bus].


(via Tony Finch)
postmortems  ba  airlines  outages  fail  via:fanf  datacenters  ups  power  progress  esb  j2ee 
26 days ago by jm
'Jupiter rising: A decade of Clos topologies and centralized control in Google’s datacenter networks'
Love the 'decade of' dig at FB and Amazon -- 'we were doing it first' ;)

Great details on how Google have built out and improved their DC networking. Includes a hint that they now use DCTCP (datacenter-optimized TCP congestion control) on their internal hosts....
datacenter  google  presentation  networks  networking  via:irldexter  ops  sre  clos-networks  fabrics  switching  history  datacenters 
october 2016 by jm
Facebook's datacenter fabric
FB goes public with its take on the Clos network-based datacenter network architecture
networking  scaling  facebook  clos-networks  fabrics  datacenters  network-architecture 
november 2014 by jm
Inside a Chinese Bitcoin Mine
The mining operation resides on an old, repurposed factory floor, and contains 2500 machines hashing away at 230 Gh/s, each. (That’s 230 billion calculations per second, per unit). [...] The operators told me that the power bill of this specific operation is in excess of ¥400,000 per month [..] about $60,000 USD.
currency  china  economics  bitcoin  power  environment  green  mining  datacenters 
august 2014 by jm
Shutterbits replacing hardware load balancers with local BGP daemons and anycast
Interesting approach. Potentially risky, though -- heavy use of anycast on a large-scale datacenter network could increase the scale of the OSPF graph, which scales exponentially. This can have major side effects on OSPF reconvergence time, which creates an interesting class of network outage in the event of OSPF flapping.

Having said that, an active/passive failover LB pair will already announce a single anycast virtual IP anyway, so, assuming there are a similar number of anycast IPs in the end, it may not have any negative side effects.

There's also the inherent limitation noted in the second-to-last paragraph; 'It comes down to what your hardware router can handle for ECMP. I know a Juniper MX240 can handle 16 next-hops, and have heard rumors that a software update will bump this to 64, but again this is something to keep in mind'. Taking a leaf from the LB design, and using BGP to load-balance across a smaller set of haproxy instances, would seem like a good approach to scale up.
scalability  networking  performance  load-balancing  bgp  exabgp  ospf  anycast  routing  datacenters  scaling  vips  juniper  haproxy  shutterstock 
may 2014 by jm
Linode announces new instance specs
'TL;DR: SSDs + Insane network + Faster processors + Double the RAM + Hourly Billing'
hosting  linode  ssd  performance  linux  ops  datacenters 
april 2014 by jm
F.B.I. Seizes Web Servers, Knocking Sites Offline
law enforcement fail. "the agents took entire server racks, perhaps because they mistakenly thought that “one enclosure is = to one server,” [DigitalOne's CEO] said in an e-mail."
search-and-seizure  law-enforcement  fbi  fail  datacenters  racks  digitalone  usa  hosting 
june 2011 by jm
Kraken
a Cray XT5 supercomputer at Oak Ridge Nat Labs -- check out that amazing skin! I've never seen a skinned datacenter before
datacenters  kraken  oak-ridge  squid  art  cray  supercomputing  from delicious
march 2010 by jm
What Second Life can teach your datacenter about scaling Web apps
good scaling advice from Linden Labs' Ian Wilkes (who doesn't seem to have a blog, sadly)
linden  ian-wilkes  scaling  datacenters  scalability  deployment  ops  services  from delicious
february 2010 by jm

Copy this bookmark:



description:


tags: