jm + sev1   2

Stephanie Dean on event management and incident response
I asked around my ex-Amazon mates on twitter about good docs on incident response practices outside the "iron curtain", and they pointed me at this blog (which I didn't realise existed).

Stephanie Dean was the front-line ops manager for Amazon for many years, over the time where they basically *fixed* their availability problems. She since moved on to Facebook, Demonware, and Twitter. She really knows her stuff and this blog is FULL of great details of how they ran (and still run) front-line ops teams in Amazon.
ops  incident-response  outages  event-management  amazon  stephanie-dean  techops  tos  sev1 
october 2014 by jm
Turbocharging Solr Index Replication with BitTorrent
Etsy now replicating their multi-GB search index across the search farm using BitTorrent. Why not Multicast? 'multicast rsync caused an epic failure for our network, killing the entire site for several minutes. The multicast traffic saturated the CPU on our core switches causing all of Etsy to be unreachable.' fun!
etsy  multicast  sev1  bittorrent  search  solr  rsync  scaling  outages 
february 2012 by jm

Copy this bookmark:



description:


tags: