jm + outages   58

Air Canada near-miss: Air traffic controllers make split-second decisions in a culture of "psychological safety" — Quartz
“’Just culture’ as a term emerged from air traffic control in the late 1990s, as concern was mounting that air traffic controllers were unfairly cited or prosecuted for incidents that happened to them while they were on the job,” Sidney Dekker, a professor, writer, and director of the Safety Science Innovation Lab at Griffith University in Australia, explains to Quartz in an email. Eurocontrol, the intergovernmental organization that focuses on the safety of airspace across Europe, has “adopted a harmonized ‘just culture’ that it encourages all member countries and others to apply to their air traffic control organizations.”

[...] One tragic example of what can happen when companies don’t create a culture where employees feel empowered to raise questions or admit mistakes came to light in 2014, when an investigation into a faulty ignition switch that caused more than 100 deaths at GM Motors revealed a toxic culture of denying errors and deflecting blame within the firm. The problem was later attributed to one engineer who had not disclosed an obvious issue with the flawed switch, but many employees spoke of extreme pressure to put costs and delivery times before all other considerations, and to hide large and small concerns.

(via JG)
just-culture  atc  air-traffic-control  management  post-mortems  outages  reliability  air-canada  disasters  accidents  learning  psychological-safety  work 
16 days ago by jm
OVH suffer 24-hour outage (The Register)
Choice quotes:

‘At 6:48pm, Thursday, June 29, in Room 3 of the P19 datacenter, due to a crack on a soft plastic pipe in our water-cooling system, a coolant leak causes fluid to enter the system';
‘This process had been tested in principle but not at a 50,000-website scale’
postmortems  ovh  outages  liquid-cooling  datacenters  dr  disaster-recovery  ops 
4 weeks ago by jm
A SPOF UPS. There was a similar AZ-wide outage in one of the Amazon DUB datacenters with a similar root cause, if I recall correctly -- supposedly redundant dual UPS systems were in fact interdependent, in that case, and power supply switchover wasn't clean enough to avoid affecting the servers.
Minutes later power was restored was resumed in what one source described as “uncontrolled fashion.” Instead of gradual restore, all power was restored at once resulting in a power surge.   BA CEO Cruz told BBC Radio this power surge  caused network hardware to fail. Also server hardware was damaged because of the power surge.

It seems as if the UPS was the single point of failure for power feed of the IT equipment in Boadicea House . The Times is reporting that the same UPS was powering both Heathrow based datacenters. Which could be a double single point of failure if true (I doubt it is)

The broken network  stopped the exchange of messages between different BA systems and application. Without messaging, there is no exchange of information between various applications. BA is using Progress Software’s Sonic [enterprise service bus].

(via Tony Finch)
postmortems  ba  airlines  outages  fail  via:fanf  datacenters  ups  power  progress  esb  j2ee 
11 weeks ago by jm
S3 2017-02-28 outage post-mortem
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.  
s3  postmortem  aws  post-mortem  outages  cms  ops 
march 2017 by jm
"I caused an outage" thread on twitter
Anil Dash: "What was the first time you took the website down or broke the build? I’m thinking of all the inadvertent downtime that comes with shipping."

Sample response: 'Pushed a fatal error in lib/display.php to all of FB’s production servers one Friday night in late 2005. Site loaded blank pages for 20min.'
outages  reliability  twitter  downtime  fail  ops  post-mortem 
march 2017 by jm
Fault Domains and the Vegas Rule | Expedia Engineering Blog
I like this concept -- analogous to AWS' AZs -- limit blast radius of an outage by explicitly defining dependency scopes
aws  az  fault-domains  vegas-rule  blast-radius  outages  reliability  architecture 
february 2017 by jm
PagerDuty Incident Response Documentation
This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process).

This is a really good set of processes -- quite similar to what we used in Amazon for high-severity outage response.
ops  process  outages  pagerduty  incident-response  incidents  on-call 
january 2017 by jm
Simple testing can prevent most critical failures
Specifically, the following 3 classes of errors were implicated in 92% of the major production outages in this study and could have been caught with simple code review:
Error handlers that ignore errors (or just contain a log statement); error handlers with “TODO” or “FIXME” in the comment; and error handlers that catch an abstract exception type (e.g. Exception or Throwable in Java) and then take drastic action such as aborting the system.

(Interestingly, the latter was a particular favourite approach of some misplaced "fail fast"/"crash-only software design" dogma in Amazon. I wasn't a fan)
fail-fast  crash-only-software  coding  design  bugs  code-review  review  outages  papers  logging  errors  exceptions 
october 2016 by jm
A Loud Sound Just Shut Down a Bank's Data Center for 10 Hours | Motherboard
The purpose of the drill was to see how the data center's fire suppression system worked. Data centers typically rely on inert gas to protect the equipment in the event of a fire, as the substance does not chemically damage electronics, and the gas only slightly decreases the temperature within the data center.

The gas is stored in cylinders, and is released at high velocity out of nozzles uniformly spread across the data center. According to people familiar with the system, the pressure at ING Bank's data center was higher than expected, and produced a loud sound when rapidly expelled through tiny holes (think about the noise a steam engine releases). The bank monitored the sound and it was very loud, a source familiar with the system told us. “It was as high as their equipment could monitor, over 130dB”.

Sound means vibration, and this is what damaged the hard drives. The HDD cases started to vibrate, and the vibration was transmitted to the read/write heads, causing them to go off the data tracks. “The inert gas deployment procedure has severely and surprisingly affected several servers and our storage equipment,” ING said in a press release.
ing  hardware  outages  hard-drives  fire  fire-suppression  vibration  data-centers  storage 
september 2016 by jm
Introducing Winston
'Event driven Diagnostic and Remediation Platform' -- aka 'runbooks as code'
runbooks  winston  netflix  remediation  outages  mttr  ops  devops 
august 2016 by jm
Google Cloud Status
Ouch, multi-region outage:
At 14:50 Pacific Time on April 11th, our engineers removed an unused GCE IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network. By itself, this sort of change was harmless and had been performed previously without incident. However, on this occasion our network configuration management software detected an inconsistency in the newly supplied configuration. The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management. In attempting to resolve this inconsistency the network management software is designed to ‘fail safe’ and revert to its current configuration rather than proceeding with the new configuration. However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.

One of our core principles at Google is ‘defense in depth’, and Google’s networking systems have a number of safeguards to prevent them from propagating incorrect or invalid configurations in the event of an upstream failure or bug. These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
multi-region  outages  google  ops  postmortems  gce  cloud  ip  networking  cascading-failures  bugs 
april 2016 by jm
Alarm design: From nuclear power to WebOps
Imagine you are an operator in a nuclear power control room. An accident has started to unfold. During the first few minutes, more than 100 alarms go off, and there is no system for suppressing the unimportant signals so that you can concentrate on the significant alarms. Information is not presented clearly; for example, although the pressure and temperature within the reactor coolant system are shown, there is no direct indication that the combination of pressure and temperature mean that the cooling water is turning into steam. There are over 50 alarms lit in the control room, and the computer printer registering alarms is running more than 2 hours behind the events.

This was the basic scenario facing the control room operators during the Three Mile Island (TMI) partial nuclear meltdown in 1979. The Report of the President’s Commission stated that, “Overall, little attention had been paid to the interaction between human beings and machines under the rapidly changing and confusing circumstances of an accident” (p. 11). The TMI control room operator on the day, Craig Faust, recalled for the Commission his reaction to the incessant alarms: “I would have liked to have thrown away the alarm panel. It wasn’t giving us any useful information”. It was the first major illustration of the alarm problem, and the accident triggered a flurry of human factors/ergonomics (HF/E) activity.

A familiar topic for this ex-member of the Amazon network monitoring team...
ergonomics  human-factors  ui  ux  alarms  alerts  alerting  three-mile-island  nuclear-power  safety  outages  ops 
november 2015 by jm
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
There was a breakdown in communication between the developer who requested the index migration and the database operator who deleted the old index. Instead of working on the migration together, they communicated in an implicit way through flawed tooling. The dashboard that surfaced the migration request was missing important context: the reason for the requested deletion, the dependency on another index’s creation, and the criticality of the index for API traffic. Indeed, the database operator didn’t have a way to check whether the index had recently been used for a query.

Good demo of how the Etsy-style chatops deployment approach would have helped avoid this risk.
stripe  postmortem  outages  databases  indexes  deployment  chatops  deploy  ops 
october 2015 by jm
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
Painful to read, but: tl;dr: monitoring oversight, followed by a transient network glitch triggering IPC timeouts, which increased load due to lack of circuit breakers, creating a cascading failure
aws  postmortem  outages  dynamodb  ec2  post-mortems  circuit-breakers  monitoring 
september 2015 by jm
Call me Maybe: Chronos
Chronos (the Mesos distributed scheduler) comes out looking pretty crappy here
aphyr  mesos  chronos  cron  scheduling  outages  ops  jepsen  testing  partitions  cap 
august 2015 by jm
Introducing Nurse: Auto-Remediation at LinkedIn
Interesting to hear about auto-remediation in prod -- we built a (very targeted) auto-remediation system in Amazon on the Network Monitoring team, but this is much bigger in focus
nurse  auto-remediation  outages  linkedin  ops  monitoring 
august 2015 by jm
Mikhail Panchenko's thoughts on the July 2015 CircleCI outage
an excellent followup operational post on CircleCI's "database is not a queue" outage
database-is-not-a-queue  mysql  sql  databases  ops  outages  postmortems 
july 2015 by jm
Call me maybe: Aerospike
'Aerospike offers phenomenal latencies and throughput -- but in terms of data safety, its strongest guarantees are similar to Cassandra or Riak in Last-Write-Wins mode. It may be a safe store for immutable data, but updates to a record can be silently discarded in the event of network disruption. Because Aerospike’s timeouts are so aggressive–on the order of milliseconds -- even small network hiccups are sufficient to trigger data loss. If you are an Aerospike user, you should not expect “immediate”, “read-committed”, or “ACID consistency”; their marketing material quietly assumes you have a magical network, and I assure you this is not the case. It’s certainly not true in cloud environments, and even well-managed physical datacenters can experience horrible network failures.'
aerospike  outages  cap  testing  jepsen  aphyr  databases  storage  reliability 
may 2015 by jm
Bigcommerce Status Page blasts IBM Softlayer Object Storage service
This is pretty heavy stuff:
Bigcommerce engineers have been very pro-active in working with our storage provider, IBM Softlayer, in finding solutions. Unfortunately, it takes two parties to come to a solution. In this case, IBM Softlayer intentionally let their Object Storage cluster fall into disrepair and chose not to scale it. This has impacted Bigcommerce, IBM and many other Softlayer customers. Our engineers placed too much trust in IBM Softlayer and that's on us. However, the catastrophic failures to see metrics and rapidly scale capacity, the decisions to let hard drives sit at 90% utilization for weeks and months, the cascading failures of an undersized cluster of 52 nodes for the busiest data center in their business speaks to IBM Softlayer’s lack of concern for their customers. We found this out 3 days ago.

(via Oisin)
softlayer  bigcommerce  outages  shambles  ibm  fail  object-storage  storage  iaas  cloud 
april 2015 by jm
When S3's eventual consistency is REALLY eventual
a consistency outage in S3 last year, resulting in about 40 objects failing read-after-write consistency for a duration of about 23 hours
s3  eventual-consistency  aws  consistency  read-after-writes  bugs  outages  stackdriver 
april 2015 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
You Cannot Have Exactly-Once Delivery
Cut out and keep:
Within the context of a distributed system, you cannot have exactly-once message delivery. Web browser and server? Distributed. Server and database? Distributed. Server and message queue? Distributed. You cannot have exactly-once delivery semantics in any of these situations.
distributed  distcomp  exactly-once-delivery  networking  outages  network-partitions  byzantine-generals  reference 
march 2015 by jm
Apple Appstore STATUS_CODE_ERROR causes worldwide service problems
Particularly notable for this horrific misfeature, noted by jgc:
I can't commit code at CloudFlare because we use two-factor auth for the VPN (and everything else) and non-Apple apps on my iPhone are asking for my iTunes password. Tried airplane mode and apps simply don't load at all!

That is a _disastrous_ policy choice by Apple. Does this mean Apple can shut down third-party app operation on iOS devices worldwide should they feel like it?
2fa  authy  apps  ios  apple  ownership  itunes  outages  appstore  fail  jgc 
march 2015 by jm
2015-02-19 GCE outage
40 minutes of multi-zone network outage for majority of instances.

'The internal software system which programs GCE’s virtual network for VM
egress traffic stopped issuing updated routing information. The cause of
this interruption is still under active investigation. Cached route
information provided a defense in depth against missing updates, but GCE VM
egress traffic started to be dropped as the cached routes expired.'

I wonder if Google Pimms fired the alarms for this ;)
google  outages  gce  networking  routing  pimms  multi-az  cloud 
february 2015 by jm
Why You Shouldn’t Use ZooKeeper for Service Discovery
In CAP terms, ZooKeeper is CP, meaning that it’s consistent in the face of partitions, not available. For many things that ZooKeeper does, this is a necessary trade-off. Since ZooKeeper is first and foremost a coordination service, having an eventually consistent design (being AP) would be a horrible design decision. Its core consensus algorithm, Zab, is therefore all about consistency. For coordination, that’s great. But for service discovery it’s better to have information that may contain falsehoods than to have no information at all. It is much better to know what servers were available for a given service five minutes ago than to have no idea what things looked like due to a transient network partition. The guarantees that ZooKeeper makes for coordination are the wrong ones for service discovery, and it hurts you to have them.

Yes! I've been saying this for months -- good to see others concurring.
architecture  zookeeper  eureka  outages  network-partitions  service-discovery  cap  partitions 
december 2014 by jm
Stellar/Ripple suffer a failure of their consensus system, resulting in a split-brain failure
Prof. Mazières’s research indicated some risk that consensus could fail, though we were nor certain if the required circumstances for such a failure were realistic. This week, we discovered the first instance of a consensus failure. On Tuesday night, the nodes on the network began to disagree and caused a fork of the ledger. The majority of the network was on ledger chain A. At some point, the network decided to switch to ledger chain B. This caused the roll back of a few hours of transactions that had only been recorded on chain A. We were able to replay most of these rolled back transactions on chain B to minimize the impact. However, in cases where an account had already sent a transaction on chain B the replay wasn’t possible.
consensus  distcomp  stellar  ripple  split-brain  postmortems  outages  ledger-fork  payment 
december 2014 by jm
How Curiosity, Luck, and the Flip of a Switch Saved the Moon Program | Motherboard
"SCE to off?" someone said. The switch was so obscure that neither of his bosses knew what he was talking about. "What the hell's that," blurted out Gerald Carr, who was in charge of communicating with the capsule. The rookie flight director, Gerry Griffin, didn't know either.

Sixty seconds had passed since the initial lightning strike. No one else knew what to do. The call to abort was fast approaching. 

Finally, Carr reluctantly gave the order in a voice far cooler than the moment. "Apollo 12, Houston, try SCE to Auxiliary, over."
spaceflight  stories  apollo  sce-to-aux  power  lightning  weather  outages  simulation  training  nasa 
november 2014 by jm
Update on Azure Storage Service Interruption
As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services. Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.

I'm really surprised MS deployment procedures allow a change to be rolled out globally across multiple regions on a single day. I suspect they soon won't.
change-management  cm  microsoft  outages  postmortems  azure  deployment  multi-region  flighting  azure-storage 
november 2014 by jm
Microsoft Azure 9-hour outage
'From 19 Nov, 2014 00:52 to 05:50 UTC a subset of customers using Storage, Virtual Machines, SQL Geo-Restore, SQL Import/export, Websites, Azure Search, Azure Cache, Management Portal, Service Bus, Event Hubs, Visual Studio, Machine Learning, HDInsights, Automation, Virtual Network, Stream Analytics, Active Directory, StorSimple and Azure Backup Services in West US and West Europe experienced connectivity issues. This incident has now been mitigated.'

There was knock-on impact until 11:00 UTC (storage in N Europe), 11:45 UTC (websites, West Europe), and 09:15 UTC (storage, West Europe), from the looks of things. Should be an interesting postmortem.
outages  azure  microsoft  ops 
november 2014 by jm
'Hosted Status Pages for Your Company'. We use these guys in $work, and their service is fantastic -- it's a line of javascript in the page template which will easily allow you to add a "service degraded" banner when things go pear-shaped, along with an external status site for when things get really messy. They've done a good clean job.
monitoring  server  status  outages  uptime  saas  infrastructure 
november 2014 by jm
Stephanie Dean on event management and incident response
I asked around my ex-Amazon mates on twitter about good docs on incident response practices outside the "iron curtain", and they pointed me at this blog (which I didn't realise existed).

Stephanie Dean was the front-line ops manager for Amazon for many years, over the time where they basically *fixed* their availability problems. She since moved on to Facebook, Demonware, and Twitter. She really knows her stuff and this blog is FULL of great details of how they ran (and still run) front-line ops teams in Amazon.
ops  incident-response  outages  event-management  amazon  stephanie-dean  techops  tos  sev1 
october 2014 by jm
Game Day Exercises at Stripe: Learning from `kill -9`
We’ve started running game day exercises at Stripe. During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster. We were very surprised by this, but grateful to have found the problem in testing. This result and others from this exercise convinced us that game days like these are quite valuable, and we would highly recommend them for others.

Excellent post. Game days are a great idea. Also: massive Redis clustering fail
game-days  redis  testing  stripe  outages  ops  kill-9  failover 
october 2014 by jm
Box Tech Blog » A Tale of Postmortems
How Box introduced COE-style dev/ops outage postmortems, and got them working. This PIE metric sounds really useful to head off the dreaded "it'll all have to come out missus" action item:
The picture was getting clearer, and we decided to look into individual postmortems and action items and see what was missing. As it was, action items were wasting away with no owners. Digging deeper, we noticed that many action items entailed massive refactorings or vague requirements like “make system X better” (i.e. tasks that realistically were unlikely to be addressed). At a higher level, postmortem discussions often devolved into theoretical debates without a clear outcome. We needed a way to lower and focus the postmortem bar and a better way to categorize our action items and our technical debt.

Out of this need, PIE (“Probability of recurrence * Impact of recurrence * Ease of addressing”) was born. By ranking each factor from 1 (“low”) to 5 (“high”), PIE provided us with two critical improvements:

1. A way to police our postmortems discussions. I.e. a low probability, low impact, hard to implement solution was unlikely to get prioritized and was better suited to a discussion outside the context of the postmortem. Using this ranking helped deflect almost all theoretical discussions.
2. A straightforward way to prioritize our action items.

What’s better is that once we embraced PIE, we also applied it to existing tech debt work. This was critical because we could now prioritize postmortem action items alongside existing work. Postmortem action items became part of normal operations just like any other high-priority work.
postmortems  action-items  outages  ops  devops  pie  metrics  ranking  refactoring  prioritisation  tech-debt 
august 2014 by jm
The Network is Reliable - ACM Queue
Peter Bailis and Kyle Kingsbury accumulate a comprehensive, informal survey of real-world network failures observed in production. I remember that April 2011 EBS outage...
ec2  aws  networking  outages  partitions  jepsen  pbailis  aphyr  acm-queue  acm  survey  ops 
july 2014 by jm
Call me maybe: Elasticsearch
Wow, these are terrible results. From the sounds of it, ES just cannot deal with realistic outage scenarios and is liable to suffer catastrophic damage in reasonably-common partitions.
If you are an Elasticsearch user (as I am): good luck. Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present. If you can, store your data in a safer database, and feed it into Elasticsearch gradually. Have processes in place that continually traverse the system of record, so you can recover from ES data loss automatically.
elasticsearch  ops  storage  databases  jepsen  partition  network  outages  reliability 
june 2014 by jm
SmartStack vs. Consul
One of the SmartStack developers at AirBNB responds to's comments. FWIW, we use SmartStack in Swrve and it works pretty well...
smartstack  airbnb  ops  consul  serf  load-balancing  availability  resiliency  network-partitions  outages 
may 2014 by jm
Adrian Cockroft's Cloud Outage Reports Collection
The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. [....] I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them.
outages  post-mortems  documentation  ops  aws  ec2  amazon  google  dropbox  microsoft  azure  incident-response 
march 2014 by jm
'Testing applications under slow or flaky network conditions can be difficult and time consuming. Blockade aims to make that easier. A config file defines a number of docker containers and a command line tool makes introducing controlled network problems simple.'

Open-source release from Dell's Cloud Manager team (ex-Enstratius), inspired by aphyr's Jepsen. Simulates packet loss using "tc netem", so no ability to e.g. drop packets on certain flows or certain ports. Still, looks very usable -- great stuff.
testing  docker  networking  distributed  distcomp  enstratius  jepsen  network  outages  partitions  cap  via:lusis 
february 2014 by jm
Kelly "kellabyte" Sommers on Redis' "relaxed CP" approach to the CAP theorem

Similar to ACID properties, if you partially provide properties it means the user has to _still_ consider in their application that the property doesn't exist, because sometimes it doesn't. In you're fsync example, if fsync is relaxed and there are no replicas, you cannot consider the database durable, just like you can't consider Redis a CP system. It can't be counted on for guarantees to be delivered. This is why I say these systems are hard for users to reason about. Systems that partially offer guarantees require in-depth knowledge of the nuances to properly use the tool. Systems that explicitly make the trade-offs in the designs are easier to reason about because it is more obvious and _predictable_.
kellabyte  redis  cp  ap  cap-theorem  consistency  outages  reliability  ops  database  storage  distcomp 
december 2013 by jm
Introducing Chaos to C*
Autoremediation, ie. auto-replacement, of Cassandra nodes in production at Netflix
ops  autoremediation  outages  remediation  cassandra  storage  netflix  chaos-monkey 
october 2013 by jm
_Availability in Globally Distributed Storage Systems_ [pdf]
empirical BigTable and GFS failure numbers from Google are orders of magnitude higher than naïve independent-failure models. (via kragen)
via:kragen  failure  bigtable  gfs  statistics  outages  reliability 
september 2013 by jm
Interview with the Github Elasticsearch Team
good background on Github's Elasticsearch scaling efforts. Some rather horrific split-brain problems under load, and crashes due to OpenJDK bugs (sounds like OpenJDK *still* isn't ready for production). painful
elasticsearch  github  search  ops  scaling  split-brain  outages  openjdk  java  jdk  jvm 
september 2013 by jm
Information on Google App Engine's recent US datacenter relocations - Google Groups
or, really, 'why we had some glitches and outages recently'. A few interesting tidbits about GAE innards though (via Bill De hOra)
gae  google  app-engine  outages  ops  paxos  eventual-consistency  replication  storage  hrd 
august 2013 by jm
the infamous 2008 S3 single-bit-corruption outage
Neat, I didn't realise this was publicly visible. A single corrupted bit infected the S3 gossip network, taking down the whole S3 service in (iirc) one region:
We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether [gossip state] had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

During our post-mortem analysis we've spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we're taking: (a) we've deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we've deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we've added additional monitoring and alarming of gossip rates and failures; and, (d) we're adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.

This is why you checksum all the things ;)
s3  aws  post-mortems  network  outages  failures  corruption  grey-failures  amazon  gossip 
june 2013 by jm
The network is reliable
Aphyr and Peter Bailis collect an authoritative list of known network partition and outage cases from published post-mortem data:

This post is meant as a reference point -- to illustrate that, according to a wide range of accounts, partitions occur in many real-world environments. Processes, servers, NICs, switches, local and wide area networks can all fail, and the resulting economic consequences are real. Network outages can suddenly arise in systems that are stable for months at a time, during routine upgrades, or as a result of emergency maintenance. The consequences of these outages range from increased latency and temporary unavailability to inconsistency, corruption, and data loss. Split-brain is not an academic concern: it happens to all kinds of systems -- sometimes for days on end. Partitions deserve serious consideration.

I honestly cannot understand people who didn't think this was the case. 3 years reading (and occasionally auto-cutting) Amazon's network-outage tickets as part of AWS network monitoring will do that to you I guess ;)
networking  outages  partition  cap  failure  fault-tolerance 
june 2013 by jm
ESB Networks | Power Check | Service Interruptions Map
real-time service outage information on a map, from Ireland's power network
esb  ireland  mapping  data  outages  service  power 
april 2013 by jm
GMail partial outage - Dec 10 2012 incident report [PDF]
TL;DR: a bad load balancer change was deployed globally, causing the impact. 21 minute time to detection. Single-location rollout is now on the cards
gmail  google  coe  incidents  postmortems  outages 
december 2012 by jm
Joyent Services Back After 8 Day Outage
Lest we forget. I think it was 10 days in total once everything was resolved
joyent  outages  bingodisk  strongspace  cloud  solaris  zfs 
july 2012 by jm
Microsoft's Azure Feb 29th, 2012 outage postmortem
'The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.' This caused cascading failures throughout the fleet. Ouch -- should have been spotted during code review
azure  dev  dates  leap-years  via:fanf  microsoft  outages  post-mortem  analysis  failure 
march 2012 by jm
Turbocharging Solr Index Replication with BitTorrent
Etsy now replicating their multi-GB search index across the search farm using BitTorrent. Why not Multicast? 'multicast rsync caused an epic failure for our network, killing the entire site for several minutes. The multicast traffic saturated the CPU on our core switches causing all of Etsy to be unreachable.' fun!
etsy  multicast  sev1  bittorrent  search  solr  rsync  scaling  outages 
february 2012 by jm
GitHub outage post-mortem
continuous-integration system was accidentally run against the production db. result: the entire production database got wiped. ouuuuch
ouch  github  outages  post-mortem  databases  testing  c-i  production  firewalls  from delicious
november 2010 by jm
Post-mortem for February 24th, 2010 outage - Google App Engine
extremely detailed; power outage in the primary DC resulted in a degraded fleet, and on-calls didn't have up-to-date on-call docs to respond correctly
google  gae  appengine  outages  post-mortems  multi-dc  reliability  distcomp  fleets  on-call  from delicious
march 2010 by jm

related tags

2fa  accidents  acm  acm-queue  action-items  aerospike  air-canada  air-traffic-control  airbnb  airlines  alarms  alerting  alerts  allspaw  amazon  analysis  ap  aphyr  apollo  app-engine  appengine  apple  apps  appstore  architecture  atc  authy  auto-remediation  autoremediation  availability  aws  az  azure  azure-storage  ba  bigcommerce  bigtable  bingodisk  bittorrent  blast-radius  bugs  byzantine-generals  c-i  cap  cap-theorem  cascading-failures  cassandra  certs  change-management  chaos-monkey  chatops  chronos  circuit-breakers  cloud  cm  cms  code-review  codeascraft  coding  coe  consensus  consistency  consul  corruption  cp  crash-only-software  cron  data  data-centers  database  database-is-not-a-queue  databases  datacenters  dates  deploy  deployment  design  dev  devops  disaster-recovery  disasters  distcomp  distributed  docker  documentation  downtime  dr  dropbox  dynamodb  EBS  ec2  elasticsearch  enstratius  ergonomics  errors  esb  etsy  eureka  event-management  eventual-consistency  exactly-once-delivery  exceptions  expiry  fail  fail-fast  failover  failure  failures  fault-domains  fault-tolerance  fire  fire-suppression  firewalls  five-whys  fleets  flighting  gae  game-days  gce  gfs  github  gmail  google  gossip  grey-failures  hard-drives  hardware  hrd  https  human-factors  iaas  ibm  incident-response  incidents  indexes  infrastructure  ing  instagram  ios  ip  ireland  itunes  j2ee  java  jdk  jepsen  jgc  joyent  just-culture  jvm  kellabyte  kill-9  leap-years  learning  ledger-fork  lifecycle  lightning  linkedin  liquid-cooling  load-balancing  logging  management  manta  mapping  mesos  metrics  microsoft  monitoring  mttr  multi-az  multi-dc  multi-region  multicast  mysql  nasa  netflix  network  network-partitions  networking  nuclear-power  nurse  object-storage  on-call  openjdk  ops  ouch  outages  ovh  ownership  pagerduty  papers  partition  partitions  paxos  payment  pbailis  pie  pimms  post-mortem  post-mortems  postgres  postmortem  postmortems  power  prioritisation  process  production  progress  psychological-safety  ranking  read-after-writes  redis  refactoring  reference  reliability  remediation  renewal  replication  resiliency  review  ripple  root-cause  routing  rsync  runbooks  s3  saas  safety  scaling  sce-to-aux  scheduling  search  serf  server  service  service-discovery  sev1  severity  shambles  simulation  slides  smartstack  smugmug  softlayer  solaris  solr  spaceflight  split-brain  sql  ssl  stackdriver  statistics  status  stellar  stephanie-dean  storage  stories  stripe  strongspace  survey  tech-debt  techops  testing  three-mile-island  tier-one-support  tos  training  twitter  ui  ups  uptime  ux  vegas-rule  via:fanf  via:kragen  via:lusis  vibration  weather  web  winston  work  zfs  zookeeper 

Copy this bookmark: