LiveChat Status - Connection issues
"identified the issue"

"working on a fix"

"fix deployed"

"connection issues caused by memory overload"
postmortem 
october 2015
StatusPage.io Status - Site under heavy load. Pages may be slow or unresponsive.
META

"A large influx of traffic caused most of the web tier to become unresponsive for a period of a few minutes. This influx of traffic as stopped, and site functioning has returned to normal."
postmortem 
october 2015
Switch Status - Inbound Calling Experienced Intermittent Issues
"5 minute issue was caused by a major traffic spike that impacted some inbound call attempts. Further investigation into the nature of these calls is ongoing"
postmortem 
october 2015
Greenhouse Status - AWS US-East-1 Partial Outage
"Our cloud hosting provider (AWS) is currently experiencing significant service degradation. The knock-on effect is decreased response times and reliability for the Greenhouse application, API, and job boards. "
postmortem 
october 2015
VictorOps Status - Isolated Service Disruption
"our database cluster encountered an error condition that caused it to stop processing queries for approximately 30 minutes. During that time, our WebUI and mobile client access was unavailable, and we were unable to process and deliver alerts. We have identified configuration settings in the cluster that will prevent a recurrence of the error condition"
postmortem 
october 2015
Greenhouse Status - Outage
"a new release triggered a doubling of connections to our database--this caused the database to become saturated with connections, causing approximately 10 minutes of system-wide downtime"
postmortem 
october 2015
Route Leak Causes Amazon and AWS Outage
"The forwarding loss combined with the sudden appearance of these two ASNs in the BGP paths strongly suggested a BGP route leak by Axcelx. Looking at the raw BGP data showed the exact BGP updates that resulted in this leak."
postmortem  aws  networks 
october 2015
June 15th Outage — HipChat Blog
" the recent Mac client release which had the much anticipated “multiple account” feature also had a subtle reconnection bug that only manifested under very high load"
postmortem 
october 2015
Mikhail Panchenko [discussion of a long past Flickr problem]
" Inserting data into RDMS indexes is relatively expensive, and usually involves at least some locking. Note that dequeueing jobs also involves an index update, so even marking jobs as in progress or deleting on completion runs into the same locks. So now you have contention from a bunch of producers on a single resource, the updates to which are getting more and more expensive and time consuming. Before long, you're spending more time updating the job index than you are actually performing the jobs. The "queue" essentially fails to perform one of its very basic functions."
postmortem 
october 2015
Kafkapocalypse: a postmortem on our service outage | Parse.ly
" Kafka is so efficient about its resource disk and CPU consumption, that we were running Kafka brokers on relatively modest Amazon EC2 nodes that did not have particularly high network capabilities. At some point, we were hitting operating system network limits and the brokers would simply become unavailable. These limits were probably enforced by Linux, Amazon’s Xen hypervisor, the host machine’s network hardware, or some combination.

The real problem here isn’t failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes. It was a classic Cascading Failure."
postmortem 
october 2015
Travis CI Status - High queue times on OSX builds (.com and .org)
"When we reviewed the resource utilization on our vSphere infrastructure, we discovered we had over 6000 virtual machines on the Xserve cluster. During normal peak build times, this number shouldn't be more than 200."
postmortem 
october 2015
Post-mortem -- S3 Outage | Status.io Blog
"Immediately we realized that the static resources (images, scripts) hosted on Amazon S3 were sporadically failing to load. A quick manual test of the S3 connection confirmed it was broken."
postmortem 
october 2015
Incident documentation/20150814-MediaWiki - Wikitech
"This bug was not caught on the beta cluster, because the code-path is exercised when converting text from one language variant to another, which does not happen frequently in that environment."
postmortem 
october 2015
GitLab.com outage on 2015-09-01 | GitLab
"We are still in the dark about the cause of the NFS slowdowns. We see no spikes of any kind of web requests around the slowdowns. The backend server only shows the ext4 errors mentioned above, which do not coincide with the NFS trouble, and no NFS error messages."
postmortem 
october 2015
Opbeat Status - We're experiencing another major database cluster connectivity issue
"During the master database outages, opbeat.com as well as our intake was unavailable."

" We've reached out to AWS to understand what caused the connectivity issue in the first place, but they have been unable to find the cause."
postmortem 
october 2015
Flying Circus Status - VM performance and stability issues
"After checking the virtualisation servers we saw that many of them had too many virtual machines assigned to them, consuming much more memory than the host actually had. "
"Looking at the algorithm that performed the evacuation when maintenance was due, we found that it located virtual machines to the best possible server. What it did not do was to prohibit machines being placed on hosts that already have too many machines. "
postmortem 
october 2015
Faithlife Status - Leaf Switch Outage
"we lost all connectivity to the switch. At that point, it failed to continue passing traffic. This should not have been a problem since compute nodes have a link to each switch. The compute nodes should have recognized that link in the aggregation as “down” and discontinued use of that link. However, our compute nodes continued sending traffic to that link. To verify the compute nodes didn’t incorrectly see the downed link as “up”, we physically disconnected the links to the degraded switch. Unfortunately, the compute nodes still attempted to send traffic over that link. "
postmortem  networks 
october 2015
Outage report: 5 September 2015 - PythonAnywhere News
"t looked like a massive earlier spike in read activity across the filesystem had put some system processes into a strange state. User file storage is shared from the file storage system over to the servers where people's websites and consoles actually run over NFS. And the storage itself on the NFS servers uses DRBD for replication. As far as we could determine, a proportion (at least half) of the NFS processes were hanging, and the DRBD processes were running inexplicably slowly. This was causing access to file storage to frequently be slow, and to occasionally fail, for all servers using the file storage."
postmortem 
october 2015
Why did Stack Overflow time out? - Meta Stack Exchange
"A cascading failure of the firewall to apply properly leading keepalived VRRP unable to communicate properly made both load balancers think neither had a peer. This results in a bit of swapping as they fight for ARP. When NY-LB06 "won" that fight, the second failure came into play: the firewall module did not finish updating on the first puppet run meaning the server was fully ready to serve traffic (from a Layer 7/HAProxy standpoint), but was not accepting TCP connections from anyone yet."
postmortem 
october 2015
SNworks Status - Connectivity Issues
"The cache server is super important. Without it, our web servers can serve 10s of requests a second. With it, they can serve 1000s of requests per second.

Large news events + uncached pages = servers not being able to handle traffic demands."
postmortem 
october 2015
Customer.io Status - Extended outage since 11:41 pm EDT
" After bringing down the cluster, we upgraded FoundationDB and attempted to bring the cluster back up.

An error prevented the cluster from returning to "operational"."
postmortem 
october 2015
Partial Photon Cloud outage on 04/30/2015 | Blog | Photon: Multiplayer Made Simple
"The root cause was an outage of Microsoft Azure’s Network Infrastructure"
postmortem 
october 2015
Simplero Status - Down right now
"The root cause was a disk that filled up on a secondary database server.

it happened while I was at event at the Empire State Building. At first I tried fixing it via my iPhone, but I had to realize that I couldn't, and instead hop on a train back home to my laptop."
postmortem 
october 2015
Divshot Status - Serious Platform Outage
"This morning around 7:20am Pacific time, several platform EC2 instances began failing and our load balancer began returning 503 errors. Ordinarily our scaling configuration would terminate and replace unhealthy instances, but for an as-yet-undetermined reason all instances became unhealthy and were not replaced. This caused a widespread outage that lasted for nearly two hours."
postmortem 
october 2015
More Details on Today's Outage [2010]
"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.

The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid."
postmortem 
october 2015
OHIO: Office of Information Technology |Anatomy of a network outage: Aug 20 - Sep 4, 2015
"If an IT system has a hidden weakness, fall opening will expose it.

This year, the 17,000 additional devices that students and staff brought with them to campus did just that, overwhelming our core network for over two weeks."
postmortem  networks 
october 2015
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
"Our automated tooling actually filed it as two separate change requests: one to add a new database index and a second to remove the old database index. Both of these change requests were reflected in our dashboard for database operators, intermingled with many other alerts. The dashboard did not indicate that the deletion request depended on the successful completion of the addition request"
postmortem 
october 2015
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
"But, on Sunday morning, a portion of the metadata service responses exceeded the retrieval and transmission time allowed by storage servers. As a result, some of the storage servers were unable to obtain their membership data, and removed themselves from taking requests"

"With a larger size, the processing time inside the metadata service for some membership requests began to approach the retrieval time allowance by storage servers. We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests."
postmortem  aws 
september 2015
Postmortem for July 27 outage of the Manta service - Blog - Joyent
"Clients experienced very high latency, and ultimately received 500-level errors in response to about 22% of all types of requests, including PUT, GET, and DELETE for both objects and directories. At peak, the error rate approached 27% of all requests, and for most of the window the error rate varied between 19 and 23%."
postmortem 
august 2015
Travis CI Status - Elevated wait times and timeouts for OSX builds
"We noticed an error pointing towards build VM boot timeouts at 23:21 UTC on the 20th. After discussion with our infrastructure provider, it was shown to us that our SAN (a NetApp appliance) was being overloaded due to a spike in disk operations per second."
postmortem 
july 2015
CircleCI Status - DB performance issue
"The degradation in DB performance was a special kind of non-linear, going from "everything is fine" to "fully unresponsive" within 2 minutes. Symptoms included a long list of queued builds, and each query taking a massive amount of time to run, along side many queries timing out."

"CircleCI is written in Clojure, a form of Lisp. One of the major advantages of this type of language is that you can recompile code live at run-time. We typically use Immutable Architecture, in which we deploy pre-baked machine images to put out new code. This works well for keeping the system in a clean state, as part of a continuous delivery model. Unfortunately, when things are on fire, it doesn't allow us to move as quickly as we would like.

This is where Clojure's live patching comes in. By connecting directly to our production machines and connecting to the Clojure REPL, we can change code live and in production. We've built tooling over the last few years to automate this across the hundreds of machines we run at a particular time.

This has saved us a number of times in the past, turning many potential-crise into mere blips. Here, this allowed us to disable queries, fix bugs and otherwise swap out and disable undesirable code when needed."
postmortem 
july 2015
NYSE Blames Trading Outage on Software Upgrade | Traders Magazine Online News
"On Tuesday evening, the NYSE began the rollout of a software release in preparation for the July 11 industry test of the upcoming SIP timestamp requirement. As is standard NYSE practice, the initial release was deployed on one trading unit. As customers began connecting after 7am on Wednesday morning, there were communication issues between customer gateways and the trading unit with the new release. It was determined that the NYSE and NYSE MKT customer gateways were not loaded with the proper configuration compatible with the new release."
postmortem 
july 2015
Elevated latency and error rate for Google Compute Engine API - Google Groups
"However, a software bug in the GCE control plane interacted
poorly with this change and caused API requests directed to us-central1-a
to be rejected starting at 03:21 PDT. Retries and timeouts from the failed
calls caused increased load on other API backends, resulting in higher
latency for all GCE API calls. The API issues were resolved when Google
engineers identified the control plane issue and corrected it at 04:59 PDT,
with the backlog fully cleared by 05:12 PDT. "
google  postmortem 
may 2015
Code Climate Status - Inaccurate Analysis Results
"Also on May 8th, we deployed instrumentation and logging to track when our cached Git blob data did not match the actual contents on disk. We found no further mismatches on new analyses, supporting the theory that the issue was ephemeral and no longer present.

Around this time we began a process of re-running old analyses that had failed, and were able to reproduce the issue. This was a critical learning, because it refuted the theory that the issue was ephemeral. With this information, we took a closer look at the objects in the analysis-level cache. We discovered that these marshaled Ruby objects did not in fact hold a reference to the contents of files as we originally believed. Problematically, the object held a reference to the Git service URL to use for remote procedure calls.

When a repository was migrated, this cache key was untouched. This outdated reference led to cat-file calls being issued to the old server instead of the new server"
postmortem 
may 2015
Stack Exchange Network Status — Outage Postmortem: January 6th, 2015
"With no way to get our main IP addresses accessible to most users, our options were to either fail over to our DR datacenter in read-only mode, or to enable CloudFlare - we’ve been testing using them for DDoS mitigation, and have separate ISP links in the NY datacenter which are dedicated to traffic from them.

We decided to turn on CloudFlare, which caused a different problem - caused by our past selves."
postmortem 
april 2015
DripStat — Post mortem of yesterday's outage
"1. RackSpace had an outage in their Northern Virginia region.
2. We were getting DDOS’d.
3. The hypervisor Rackspace deployed our cloud server on was running into issue and would keep killing our java process.
We were able to diagnose 2 and 3 only after Rackspace recovered from their long load balancer outage. The fact that all 3 happened at the same time did not help issues either."
postmortem 
april 2015
Blog - Tideways
"On Wednesday 6:05 Europe/Berlin time our Elasticsearch cluster went down when it ran OutOfMemory and file descriptors. One node of the cluster did not recover from this error anymore and the other responded to queries with failure.

The workers processing the performance and trace event log data with Beanstalk message queue stopped.
"
postmortem 
april 2015
CopperEgg Status - Probe widgets not polling data
"the primary of a redundant pair of data servers for one of our customer data clusters locked up hard in Amazon. An operations engineer responded to a pager alert and ensured failover had worked as designed; there was a brief period of probe delay on that cluster from the initial failover but service was only briefly interrupted and then the system was working fine.

The failed server had to be hard rebooted and when it was, its data was corrupted and the server had to be rebuilt, and was then set up to resync its data with the live server. A manual error was made during the rebuild and replication was set up in an infinite loop. "
postmortem 
april 2015
Freckle Time Tracking Status - Freckle is down
"The underlying reason why nginx didn't start was that DNS was not working properly—nginx checks SSL certificates and it couldn't resolve one of the hosts needed to verify our main SSL certificate. We don't know why DNS didn't resolve, but it's likely that to the large number of booted servers in the Rackspace datacenter there was a temporary problem with DNS resolution requests.)"
postmortem 
april 2015
Dead Man's Snitch — Postmortem: March 6th, 2015
"On Friday, March 6th we had a major outage caused by a loss of historical data. During the outage we failed to alert on missed snitch check-ins and sent a large number of erroneous failure alerts for healthy snitches. It took 8 hours to restore or reconstruct all missing data and get our systems stabilized."
postmortem 
april 2015
Travis CI Status - Slow .com build processing
"Two runaway TLS connections inside our primary RabbitMQ node that were causing high CPU usage on that node. Once this was found, we deemed the high channel count a red herring and instead started work on the stuck connections."
postmortem 
april 2015
Travis CI Status - Slow .com build processing
"We looked at our metrics and quickly realised that our RabbitMQ instance had gone offline at 17:30 UTC. We tried to bring it back up, but it wouldn’t start up cleanly. One of the remediation actions after Tuesday’s RabbitMQ outage was to upgrade our cluster to run on more powerful servers, so we decided that instead of debugging why our current cluster wasn’t starting we’d perform emergency maintenance and spin up a new cluster."
postmortem 
april 2015
Balanced Partial Outage Post Mortem - 2015-03-15
Balanced experienced a partial outage that affected 25% of card processing transactions between 8:40AM and 9:42AM this morning due to a degraded machine which was not correctly removed from the load balancer.

The core of the issue was in our secure vault system, which handles storage and retrieval of sensitive card data. One of the machines stopped sending messages, which cause some requests to be queued up but not processed but our automated health checks did not flag the machine as unhealthy.
postmortem 
april 2015
Postmortem: Storify downtime on March 2nd (with image) · storifydev · Storify
"The problem was that we had one dropped index in our application code. This meant that whenever the new primary took the lead, the application asked to build that index. It was happening in the background, so it was kind of ok for the primary. But as soon as the primary finished, all the secondaries started building it in the foreground, which meant that our application couldn't reach MongoDB anymore."
postmortem 
march 2015
GCE instances are not reachable
"ROOT CAUSE [PRELIMINARY]

The internal software system which programs GCE’s virtual network for VM
egress traffic stopped issuing updated routing information. The cause of
this interruption is still under active investigation. Cached route
information provided a defense in depth against missing updates, but GCE VM
egress traffic started to be dropped as the cached routes expired. "
postmortem 
february 2015
FAQ about the recent FBI raid (Pinboard Blog)
"Why did the FBI take a Pinboard server?

I don't know. As best I can tell, the FBI was after someone else whose server was in physical proximity to ours. "
postmortem 
february 2015
A Note on Recent Downtime (Pinboard Blog)
"Of course I was wrong about that, and my web hosts pulled the plug early in the morning on the 2nd. Bookmarks and archives were not affected, but I neglected to do a final sync of notes (notes in pinboard are saved as files). This meant about 20 users who created or edited notes between December 31 and Jan 2 lost those notes."
postmortem 
february 2015
Recent Bounciness And When It Will Stop (Pinboard Blog)
"Over the past week there have been a number of outages, ranging in length from a few seconds to a couple of hours. Until recently, Pinboard has had a good track record of uptime, and like my users I find this turn of events distressing.

I'd like to share what I know so far about the problem, and what steps I'm taking to fix it."
postmortem 
february 2015
Outage This Morning (Pinboard Blog)
"The root cause of the outage appears to have been a disk error. The server entered a state where nothing could write to disk, crashing the database. We were able to reboot the server, but then had to wait a long time for it to repair the filesystem."
postmortem 
february 2015
Second Outage (Pinboard Blog)
"The main filesystem on our web server suddenly went into read-only mode, crashing the database. Once again I moved all services to the backup machine while the main server went through its long disk check."
postmortem 
february 2015
API Outage (Pinboard Blog)
"Pinboard servers came under DDOS attack today and the colocation facility (Datacate) has insisted on taking the affected IP addresses offline for 48 hours. In my mind, this accomplishes the goal of the denial of service attack, but I am just a simple web admin.

I've moved the main site to a secondary server and will do the same for the APi in the morning (European time) when there's less chance of me screwing it up. Until then the API will be unreachable."
postmortem 
february 2015
A Bad Privacy Bug (Pinboard Blog)
"
tl;dr: because of poor input validation and a misdesigned schema, bookmarks could be saved in a way that made them look private to the ORM, but public to the database. Testing failed to catch the error because it was done from a non-standard account..

There are several changes I will make to prevent this class of problem from recurring:

Coerce all values to the expected types at the time they are saved to the database, rather than higher in the call stack.

Add assertions to the object loader so it complains to the error log if it sees unexpected values.

Add checks to the templating code to prevent public bookmarks showing up under any circumstances on certain public-only pages.

Run deployment tests from a non-privileged account."
postmortem 
february 2015
Facebook & Instagram API servers down
Not much detail there.

Config change? Security?

See: https://blog.thousandeyes.com/facebook-outage-deep-dive/

Also: "Facebook Inc. on Tuesday denied being the victim of a hacking attack and said its site and photo-sharing app Instagram had suffered an outage after it introduced a configuration change."
postmortem 
february 2015
Final Root Cause Analysis and Improvement Areas: Nov 18 Azure Storage Service Interruption | Microsoft Azure Blog
"1. The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.
[...]
2. Although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends."
postmortem 
december 2014
Incident Report - DDoS Attack - DNSimple Blog
"A new customer signed up for our service and brought in multiple domains that were already facing a DDoS attack. The customer had already tried at least 2 other providers before DNSimple. Once the domains were delegated to us, we began receiving the traffic from the DDoS.

DNSimple was not the target of the attack, nor were any of our other customers.

The volume of the attack was approximately 25gb/s sustained traffic across our networks, with around 50 million packets per second. In this case, the traffic was sufficient enough to overwhelm the 4 DDoS devices we had placed in our data centers after a previous attack (there is also a 5th device, but it was not yet online in our network)."
postmortem 
december 2014
craigslist DNS Outage | craigslist blog
"At approximately 5pm PST Sunday evening the craigslist domain name service (DNS) records maintained at one of our domain registrars were compromised, diverting users to various non-craigslist sites.

This issue has been corrected at the source, but many internet service providers (ISPs) cached the false DNS information for several hours, and some may still have incorrect information."
postmortem 
november 2014
Update on Azure Storage Service Interruption | Microsoft Azure Blog
" Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting."
postmortem 
november 2014
Anatomy of a Crushing (Pinboard Blog)
"The bad news was that it had never occurred to me to test the database under write load.

Now, I can see the beardos out there shaking their heads. But in my defense, heavy write loads seemed like the last thing Pinboard would ever face. It was my experience that people approached an online purchase of six dollars with the same deliberation and thoughtfulness they might bring to bear when buying a new car. Prospective users would hand-wring for weeks on Twitter and send us closely-worded, punctilious lists of questions before creating an account.

The idea that we might someday have to worry about write throughput never occurred to me. If it had, I would have thought it a symptom of nascent megalomania. "
postmortem 
november 2014
The network nightmare that ate my week
"I have come to the conclusion that so much in IPv6 design and implementation has been botched by protocol designers and vendors (both ours and others) that it is simply unsafe to run IPv6 on a production network except in very limited geographical circumstances and with very tight central administration of hosts."
postmortem 
november 2014
Inherent Complexity of the Cloud: VS Online Outage Postmortem
"it appears that the outage is at least due in part to some license checks that had been improperly disabled, causing unnecessary traffic to be generated.  Adding to the confusion (and possible causes) was the observation of “…a spike in latencies and failed deliveries of Service Bus messages”"
postmortem 
november 2014
Stack Exchange Network Status — Outage Post-Mortem: August 25th, 2014
"a misleading comment in the iptables configuration led us to make a harmful change. The change had the effect of preventing the HAProxy systems from being able to complete a connection to our IIS web servers - the response traffic for those connections (the SYN/ACK packet) was suddenly being blocked."
postmortem 
november 2014
Morgue: Helping Better Understand Events by Building a Post Mortem Tool - Bethany Macri on Vimeo
"My talk will be about why myself and another engineer built an internal post mortem tool called Morgue and the effect that the tool has had on our organization. Morgue formalized and systematized the way [my company] as a whole runs post mortems by focusing both the leader and the attendees of the post mortem on the most important aspects of resolving and understanding the event in a consistent way. In addition, the tool has facilitated relations between Ops and Engineers by increasing the awareness of Ops’ involvement in an outage and also by making all of the post mortems easily available to anyone in the organization. Lastly, all of our developers have access to the Morgue repository and have continued to develop features for the tool as improvements for conducting a post mortem have been suggested."
postmortem 
november 2014
Contributors Section of Supermarket Disabled – Postmortem Meeting | Chef Blog
"At Chef, we conduct postmortem meetings for outages and issues with the site and services. Since Supermarket belongs to the community, and we are developing the application in the open, we would like to invite you, the community, to listen in or participate in public postmortem meetings for these outages."
postmortem 
november 2014
Apologies for the downtime, but we're coming back stronger.
"The old prototype machine had our AWS API access key and secret key. Once the hacker gained access to the keys, he created an IAM user, and generated a key-pair. He was then able to run an instance inside our AWS account using these credentials, and mount one of our backup disks. This backup was of one of our component services, used for production environment, and contained a config file with our database password. He also whitelisted his IP on our database security group, which is the AWS firewall."
postmortem  security 
november 2014
eatabit.com | Blog
"Ok, so we have a bug in our code (in the form of the extra whitespace charater). So why did this start all of a sudden? That bug would have existed for at least 9 months on the firmware in the field...

Cowboy. Who is runnig the Cowboy server that is silently blocking our (admittedly malformed) requests? Heroku? AWS? Cursory Google searches do not allude to either party's use of Cowboy. We have requests submitted to both parties and are waiting to hear back... Stay tuned."
postmortem 
october 2014
Slack: This was not normal. Really.
"13% of Slack’s users were disconnected from Slack during this window.
Those users all immediately attempted reconnecting simultaneously.
The massive number of simultaneous reconnections demanded more database capacity than we had readily available, which caused cascading connection failures."
postmortem 
october 2014
EC2 Maintenance Update II
I'd like to give you an update on the EC2 Maintenance announcement that I posted last week. Late yesterday (September 30th), we completed a reboot of less than 10% of the EC2 fleet to protect you from any security risks associated with the Xen Security Advisory (XSA-108).

This Xen Security Advisory was embargoed until a few minutes ago; we were obligated to keep all information about the issue confidential until it was published. The Xen community (in which we are active participants) has designed a two-stage disclosure process that operates as follows:
aws  postmortem  security 
october 2014
freistil IT » Post mortem: Network issues last week
"our monitoring system started at about 10:10 UTC to alert us of network packet loss levels of 50% to 100% with a number of servers and a lot of failing service checks, which most of the times is a symptom of connectivity problems. We recognized quickly that most of the servers with bad connectivity were located in Hetzner datacenter #10. We also received Twitter posts from Hetzner customers whose servers were running in DC #10. This suggested a problem with a central network component, most probably a router or distribution switch."
networks  postmortem 
september 2014
Post Mortem - City Cloud
"In a few minutes two nodes of two different replicating pairs experienced network failures. Still not a problem due to Gluster redundancy but clearly a sign of something not being right. While in discussions with Gluster to identify the cause one more node experiences network failure. This time in one of the pairs that already has a node offline. This causes all data located on that pair to become unavailable. "
postmortem 
september 2014
Fog Creek System Status: May 5-6 Network Maintenance Post-Mortem
"During the process of rearchitecting our switch fabric's spanning tree (moving from a more control-centric per-vlan spanning tree to a faster-failover rapid spanning tree, ironically to keep downtime to a minimum), we suddenly lost access to our equipment."
postmortem  networks 
september 2014
Google App Engine Issues With Datastore OverQuota Errors Beginning August 5th, 2014 - Google Groups
SUMMARY:
On Tuesday 5 August and Wednesday 6 August 2014, some billed applications incorrectly received quota exceeded errors for a small number of requests. We sincerely apologize if your application was affected.

DETAILED DESCRIPTION OF IMPACT:
Between Tuesday 5 August 11:39 and Wednesday 6 August 19:05 US/Pacific, some applications incorrectly received quota exceeded errors. The incident predominantly affected Datastore API calls. On Tuesday 5 August, 0.2% of applications using the Datastore received some incorrect quota exceeded errors. On Wednesday 6 August, 0.8% of applications using the Datastore received some incorrect quota exceeded errors. On Tuesday 5 August, 0.001% of Datastore API calls failed with quota exceeded for affected applications. On Wednesday 6 August, 0.0005% of Datastore API calls failed for affected applications.

ROOT CAUSE:
The root cause of this incident was a transient failure of the component that handles quota checking. The component has been corrected.

REMEDIATION AND PREVENTION:
The incident was resolved when the issue that caused the transient errors went away. To prevent a recurrence of similar incidents, we have enabled additional logging in the affected components so that we can more quickly diagnose and resolve similar issues.
google  postmortem 
august 2014
The Upload Outage of July 29, 2014 « Strava Engineering
"Although the range of signed integers goes from -2147483648 to 2147483647, only the positive portion of that range is available for auto-incrementing keys. At 15:10, the upper limit was hit and insertions into the table started failing."
postmortem 
august 2014
BBC Online Outage on Saturday 19th July 2014
"At 9.30 on Saturday morning (19th July 2014) the load on the database went through the roof, meaning that many requests for metadata to the application servers started to fail.

The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail."

"At almost the same time we had a second problem."
postmortem 
july 2014
The npm Blog — 2014-01-28 Outage Postmortem
"While making a change to simplify the Varnish VCL config on Fastly, we added a bug that caused all requests to go to Manta, including those that should have gone to CouchDB.

Since Manta doesn’t know how to handle requests like /pkgname, these all returned 403 Forbidden responses.

Because Fastly is configured to not cache error codes, this proliferation of 403 responses led to a thundering herd which took a bit of time to get under control.

With the help of the Fastly support team, we have identified the root cause and it is now well understood"
postmortem 
july 2014
NY1 (Equinix) Power Issue Postmortem | DigitalOcean
2013-11-25
"When the redundancy failed and another UPS did not take over, it essentially meant that power was cut off to equipment. UPS7 then hard rebooted and was back online, which then resumed the flow of power to equipment; however, there was an interruption of several minutes in between."
postmortem 
july 2014
Stack Exchange Network Status — 2013-10-13 Outage PostMortem
" A further loss of communication between the 2 nodes while Oregon is offline results in a quorum loss from the point of view of both members.

To prevent a split-brain situation, the nodes enter an effective offline state when a loss of quorum occurs. When windows clustering observes a quorum loss, it initiates a state change of orphaned SQL resources (the availability groups the databases affected belong to). In the case of NY-SQL03 (the primary before the event), the databases were both not primary and not available since the AlwaysOn Availability Group was offline to prevent split brain"
postmortem 
july 2014
NY2 Network Upgrade Postmortem | DigitalOcean
2013
"On October 25th we observed a network interruption whereby the two core routers began flapping and their redundant protocol was not allowing either one to take over as the active device and push traffic out to our providers."
postmortem 
july 2014
What happened yesterday and what we are doing about it ‹ The Mailgun Blog
2013-09-20

"in this particular case it triggered a bug in Mailgun that slowed down our Riak clusters by overloading them with unnecessary requests and consuming excessive storage."
postmortem 
july 2014
2013-09-17 Outage Postmortem | AppNexus Tech Blog
" the data update that caused the problem was a delete on a rarely-changed in-memory object. The result of the processing of the update is to unlink the deleted object from other objects, and schedule the object’s memory for deletion at what is expected to be a safe time in the future. This future time is a time when any thread that could have been using the old version at the time of update would no longer be using it. There was a bug in the code that deleted the object twice, and when it finally executed, it caused the crash. "
postmortem 
july 2014
« earlier      later »
automation aws build business cache cloudgeneral concurrency configuration consistency data debug devgeneral distributedgeneral dns functional git google government graphdb ha hadoop hpccloud http humor illumos internet io java joyent jvm legal loadbalancing logging madison management messages monitoring networks nginx nosql paas performance politics postgresql postmortem privacy python redis rest riak scala scraping security semantic spinnaker sql storage testing timf turk video virtualization visualization

Copy this bookmark:



description:


tags: