peakscale + postmortem + aws   36

Summary of the AWS Service Event in the Sydney Region
"The service disruption primarily affected EC2 instances and their associated Elastic Block Store (“EBS”) volumes running in a single Availability Zone. "
aws  postmortem 
june 2016 by peakscale
Route Leak Causes Amazon and AWS Outage
"The forwarding loss combined with the sudden appearance of these two ASNs in the BGP paths strongly suggested a BGP route leak by Axcelx. Looking at the raw BGP data showed the exact BGP updates that resulted in this leak."
postmortem  aws  networks 
october 2015 by peakscale
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
"But, on Sunday morning, a portion of the metadata service responses exceeded the retrieval and transmission time allowed by storage servers. As a result, some of the storage servers were unable to obtain their membership data, and removed themselves from taking requests"

"With a larger size, the processing time inside the metadata service for some membership requests began to approach the retrieval time allowance by storage servers. We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests."
postmortem  aws 
september 2015 by peakscale
EC2 Maintenance Update II
I'd like to give you an update on the EC2 Maintenance announcement that I posted last week. Late yesterday (September 30th), we completed a reboot of less than 10% of the EC2 fleet to protect you from any security risks associated with the Xen Security Advisory (XSA-108).

This Xen Security Advisory was embargoed until a few minutes ago; we were obligated to keep all information about the issue confidential until it was published. The Xen community (in which we are active participants) has designed a two-stage disclosure process that operates as follows:
aws  postmortem  security 
october 2014 by peakscale
A narrowly averted disaster
" BF, being a large operation with many many servers, has AWS contacts and resources that a little site like Stellar does not. Raymond got in contact with those resources and eventually an answer came back: regardless of the bad all-Adam-faves backup, you should be able to restore the database back to a point in time within the past 24 hours, perhaps even further. [I don’t want to ding Amazon here, because I love AWS and their support people were very helpful in resolving this matter, but the docs could be clearer on this point. "
postmortem  aws 
august 2013 by peakscale
Summary of the AWS Service Event in the US East Region
"service disruption which occurred last Friday night, June 29th, in one of our Availability Zones in the US East-1 Region. The event was triggered during a large scale electrical storm which swept through the Northern Virginia area"
postmortem  aws 
july 2013 by peakscale
Amazon S3 Availability Event: July 20, 2008
"Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system."

"On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests."
postmortem  aws  storage 
january 2013 by peakscale
Summary of the Amazon SimpleDB Service Disruption
"In this event, multiple storage nodes became unavailable simultaneously in a single data center (after power was lost to the servers on which these nodes lived).

While SimpleDB can handle multiple simultaneous node failures, and has successfully endured larger infrastructure failures in the past without incident, the server failure pattern in this event resulted in a sudden and significant increase in load on the lock service as it rapidly de-registered the failed storage nodes from their respective replication groups.

This simultaneous volume resulted in elevated handshake latencies between healthy SimpleDB nodes and the lock service, and the nodes were not able to complete their handshakes prior to exceeding a set “handshake timeout” value."
postmortem  aws 
january 2013 by peakscale
Summary of the October 22, 2012 AWS Service Event in the US-East Region
"The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers"
postmortem  aws  storage  monitoring  networks 
january 2013 by peakscale
Outage Post-Mortem | PagerDuty Blog
"yesterday the system did not perform as designed. While we’re looking forward to reading AWS’s official post mortem, our own investigation indicates that at least three nominally independent AZs in US-East-1 all simultaneously dropped from the Internet for 30 minutes. This left us with no hardware to accept incoming events, nor to dispatch notifications for events we’d already received."
postmortem  aws 
january 2013 by peakscale
All Services Up; Previous Service Outage Postmortem | Apigee
"our architecture is still somewhat vulnerable at the data center level. Last night Amazon Web Services had a power outage in an Availability Zone in the US East Availability Region that by sheer chance took down Apigee's entire ServiceNet cluster "

"Addendum: We did also discover that we weren't properly monitoring the service that handles our SSL traffic, resulting in a longer downtime of roughly four hours. "
postmortem  aws 
january 2013 by peakscale
Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region
"disruption primarily affected EC2 instances, RDS instances, and a subset of EBS volumes in a single Availability Zone in the EU West Region."

"UPSs that provide a short period of battery power quickly drained and we lost power to almost all of the EC2 instances and 58% of the EBS volumes in that Availability Zone."
postmortem  aws 
january 2013 by peakscale
Outage Post Mortem – March 15 | PagerDuty Blog
"region-wide failure occurred early this morning, in which AWS suffered internet connectivity issues across all of its US-East-1 region at once."
postmortem  aws 
january 2013 by peakscale
Netflix Tech Blog: A Closer Look At The Christmas Eve Outage
Stung by AWS ELB outage. "Netflix is designed to handle failure of all or part of a single availability zone in a region as we run across three zones and operate with no loss of functionality on two.  We are working on ways of extending our resiliency to handle partial or complete regional outages."
aws  postmortem  loadbalancing 
january 2013 by peakscale
Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region
"The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated. This access was incorrectly set to be persistent rather than requiring a per access approval. We have reverted this incorrect configuration and all access to production ELB data will require a per-incident CM approval."
aws  postmortem  configuration  loadbalancing 
january 2013 by peakscale
The Netflix Tech Blog: Post-mortem of October 22,2012 AWS degradation
"Netflix however, while not completely unscathed, handled the outage with very little customer impact. We did some things well and could have done some things better, and we'd like to share with you the timeline of the outage from our perspective and some of the best practices we used to minimize customer impact."
aws  postmortem 
october 2012 by peakscale
Summary of the October 22, 2012 AWS Service Event in the US-East Region
"While not noticed at the time, the DNS update did not successfully propagate to all of the internal DNS servers"
aws  postmortem 
october 2012 by peakscale
Datadog - Amazon hiccups, mayhem ensues
"1 .In the short term, in the few places where it’s still the case, we will do without shared block storage, which was at the root of this incident.

2. Datadog’s infrastructure is already distributed across multiple zones; that has served well in the past to survive a number of similar outages. We have already planned to go even further to increase our availability."
aws  postmortem 
october 2012 by peakscale
Applying 5 Whys to Amazon EC2 Outage
"Of several impairments and service disruptions caused by the outage, an hour-long unavailability of us-east-1 control plane is in my opinion the most important."

"Let’s apply 5 whys analysis to this impact. All answers below are direct quotes from the report, with my occasional notes where needed."
aws  postmortem 
october 2012 by peakscale
Heroku Status: Widespread Application Outage
"Starting last Thursday, Heroku suffered the worst outage in the nearly four years we've been operating. Large production apps using our dedicated database service may have experienced up to 16 hours of operational downtime. "
aws  postmortem  paas 
october 2012 by peakscale
The Netflix Tech Blog: Lessons Netflix Learned from the AWS Storm
"In our middle tier load-balancing, we had a cascading failure that was caused by a feature we had implemented to account for other types of failures."
postmortem  aws 
july 2012 by peakscale
Summary of the AWS Service Event in the US East Region
" The second form of impact was degradation of service “control planes” which allow customers to take action and create, remove, or change resources across the Region. While control planes aren’t required for the ongoing use of resources, they are particularly useful in outages where customers are trying to react to the loss of resources in one Availability Zone by moving to another."
postmortem  aws  ha  loadbalancing 
july 2012 by peakscale
The Netflix Tech Blog: Lessons Netflix Learned from the AWS Outage
Lessons Learned
* Create More Failures
* Automate Zone Fail Over and Recovery
* Multiple Region Support
* Avoid EBS Dependencies
aws  cloudgeneral  ha  postmortem 
april 2011 by peakscale
Summary of the Amazon EC2 and Amazon RDS Service Disruption
"primarily involved a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations. In this document, we will refer to these as “stuck” volumes. This caused instances trying to use these affected volumes to also get “stuck” when they attempted to read or write to them. In order to restore these volumes and stabilize the EBS cluster in that Availability Zone, we disabled all control APIs (e.g. Create Volume, Attach Volume, Detach Volume, and Create Snapshot) for EBS in the affected Availability Zone for much of the duration of the event. For two periods during the first day of the issue, the degraded EBS cluster affected the EBS APIs and caused high error rates and latencies for EBS calls to these APIs across the entire US East Region. As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring."
aws  postmortem 
april 2011 by peakscale

Copy this bookmark:



description:


tags: