Requests to Google Cloud Storage (GCS) JSON API experienced elevated error rates for a period of 3 hours and 15 minutes
"A low-level software defect in an internal API service that handles GCS JSON requests caused infrequent memory-related process terminations. These process terminations increased as a result of a large volume in requests to the GCS Transfer Service, which uses the same internal API service as the GCS JSON API. This caused an increased rate of 503 responses for GCS JSON API requests for 3.25 hours."
postmortem  google 
10 days ago
What did OVH learn from 24-hour outage? Water and servers do not mix
Including an article because the original incident log is in French.
postmortem 
10 days ago
Google Cloud Status Dashboard
"At the time of incident, Google engineers were upgrading the network topology and capacity of the region; a configuration error caused the existing links to be decommissioned before the replacement links could provide connectivity, resulting in a loss of connectivity for the asia-northeast1 region. Although the replacement links were already commissioned and appeared to be ready to serve, a network-routing protocol misconfiguration meant that the routes through those links were not able to carry traffic."
postmortem  google 
5 weeks ago
Update on the April 5th, 2017 Outage
"Within three minutes of the initial alerts, we discovered that our primary database had been deleted. Four minutes later we commenced the recovery process, using one of our time-delayed database replicas. Over the next four hours, we copied and restored the data to our primary and secondary replicas. The duration of the outage was due to the time it took to copy the data between the replicas and restore it into an active server."
postmortem 
april 2017
Google Cloud Status Dashboard
"On Monday 30 January 2017, newly created Google Compute Engine instances, Cloud VPNs and network load balancers were unavailable for a duration of 2 hours 8 minutes."
postmortem 
february 2017
The Travis CI Blog: The day we deleted our VM images
"In addition, our cleanup service had been briefly disabled to troubleshooting a potential race condition. Then we turned the automated cleanup back on. The service had a default hard coded amount of how many image names to query from our internal image catalog and it was set to 100.

When we started the cleanup service, the list of 100 image names, sorted by newest first, did not include our stable images, which were the oldest, did not get included in the results. Our cleanup service then promptly started deleting the older images from GCE, because its view of the world told it that those older images where no longer in use, i.e it looked like they were not in our catalog and all of our stable images got irrevocably deleted.

This immediately stopped builds from running. "
postmortem 
september 2016
Google Cloud Status Dashboard
"While removing a faulty router from service, a new procedure for diverting traffic from the router was used. This procedure applied a new configuration that resulted in announcing some Google Cloud Platform IP addresses from a single point of presence in the southwestern US. As these announcements were highly specific they took precedence over the normal routes to Google's network and caused a substantial proportion of traffic for the affected network ranges to be directed to this one point of presence. This misrouting directly caused the additional latency some customers experienced.

Additionally this misconfiguration sent affected traffic to next-generation infrastructure that was undergoing testing. This new infrastructure was not yet configured to handle Cloud Platform traffic and applied an overly-restrictive packet filter."
postmortem  google 
august 2016
Stack Exchange Network Status — Outage Postmortem - July 20, 2016
"The direct cause was a malformed post that caused one of our regular expressions to consume high CPU on our web servers. The post was in the homepage list, and that caused the expensive regular expression to be called on each home page view. "
postmortem 
july 2016
Summary of the AWS Service Event in the Sydney Region
"The service disruption primarily affected EC2 instances and their associated Elastic Block Store (“EBS”) volumes running in a single Availability Zone. "
aws  postmortem 
june 2016
Crates.io is down [fixed] - The Rust Programming Language Forum
OK, a quick post-mortem:

At 9:45 AM PST I got a ping that crates.io was down and started looking into it. Connections via the website and from the 'cargo' command were timing out. From Heroku's logs it looks like the timeouts began around 9:10 AM.

From looking at logs (18 28) it's clear that connections were timing out, and that a number of postgres queries were blocked updating the download statistics10. These queries were occupying all available connections.

After killing outstanding queries the site is working again. It's not clear yet what the original cause was.
postmortem 
june 2016
SNOW Status - Elevated Errors on SNOW Backend
"Todays outage was because of a mis-configuration in our Redis cluster, where we didn't automatically prune stale cache keys."
postmortem 
may 2016
Postmortem: A tale of how Discourse almost took us out.
"TL;DR

This morning we noticed that Sidekiq had 13K jobs, it quickly escalated to 14K and then 17K and kept growing, for reasons we do not understand yet. We know this was initially cause by a large backlog of emails that needed to be sent because of exceptions that were occurring due to this bug, this is when things got interesting, and got wildly out of control."
postmortem 
may 2016
Elastic Cloud Outage: Root Cause and Impact Analysis | Elastic
"What happened behind the scenes was that our Apache ZooKeeper cluster lost quorum, for the first time in more than three years. After recent maintenance, a heap space misconfiguration on the new nodes resulted in high memory pressure on the ZooKeeper quorum nodes, causing ZooKeeper to spend almost all CPU garbage collecting. When an auxiliary service that watches a lot of the ZooKeeper database reconnected, this threw ZooKeeper over the top, which in turn caused other services to reconnect – resulting in a thundering herd effect that exacerbated the problem."
postmortem 
may 2016
Connectivity issues with Cloud VPN in asia-east1 - Google Groups
"On Monday, 11 April, 2016, Google Compute Engine instances in all regions
lost external connectivity for a total of 18 minutes"
google  postmortem 
april 2016
**Gliffy Online System Outage** : Gliffy Support Desk
"On working to resolve the issue, an administrator accidentally deleted the production database."
postmortem 
march 2016
What Happened: Adobe Creative Cloud Update Bug
"Wednesday night we started getting support tickets relating to the .bzvol file being removed from computers. Normally the pop-up sends people to this (which we have since edited to highlight this current issue): bzvol webpage. The problem was, the folks on Mac kept reporting that our fix did not work and that they kept getting the error. Our support team contacted our lead Mac developer for help trying to troubleshoot and figure out what was causing this surge."
postmortem 
february 2016
January 28th Incident Report · GitHub
"Our early response to the event was complicated by the fact that many of our ChatOps systems were on servers that had rebooted. We do have redundancy built into our ChatOps systems, but this failure still caused some amount of confusion and delay at the very beginning of our response. "

"We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code."
postmortem 
february 2016
Linode Blog » The Twelve Days of Crisis – A Retrospective on Linode’s Holiday DDoS Attacks
"Lesson one: don’t depend on middlemen
Lesson two: absorb larger attacks
Lesson three: let customers know what’s happening"
postmortem  networks 
january 2016
Outage postmortem (2015-12-17 UTC) : Stripe: Help & Support
"the retry feedback loop and associated performance degradation prevented us from accepting new API requests at all"
postmortem 
december 2015
Postmortem: Outage due to Elasticsearch’s flexibility and our carelessness
"To add promotions in the same Elasticsearch, the CMS team decided to add a new doctype/mapping in Elasticsearch called promotions. The feature was tested locally by the developer and it worked fine. No issues were caught anywhere during testing and the code was pushed to production.

The feature came into use when our content team started curating the content past midnight for the sale starting the next day. Once they added the content and started the re-indexing procedure (which is a manual button click), our consumer app stopped working. As soon as the team started indexing the content around 4:30 AM, our app stopped working. Our search queries started returning NumberFormatException (add more details here) on our price field."
postmortem 
december 2015
400 errors when trying to create an external (L2) Load Balancer for GCE/GKE services - Google Groups
"a minor update to the Compute Engine API inadvertently changed the case-sensitivity of the “sessionAffinity” enum variable in the target pool definition, and this variation was not covered by testing."
postmortem  google 
december 2015
Network Connectivity and Latency Issues in Europe
"On Tuesday, 10 November 2015, outbound traffic going through one of our
European routers from both Google Compute Engine and Google App Engine
experienced high latency for a duration of 6h43m minutes. If your service
or application was affected, we apologize — this is not the level of
quality and reliability we strive to offer you, and we have taken and are
taking immediate steps to improve the platform’s performance and
availability. "
postmortem  google  networks 
november 2015
Spreedly Status - 503 Service Unavailable
"The issue here was an unbounded queue. We'll address that by leveraging rsyslog's advanced queueing options without neglecting one very important concern: certain activities must always be logged in the system. Also, we need to know when rsyslog is unable to work off it's queue, so we are going to find a way to be alerted as soon as that is the case."
postmortem 
october 2015
CircleCI Status - Load balancer misconfiguration
"A load balancer misconfiguration briefly prevented us from serving content. We caught the problem and are fixing"
postmortem 
october 2015
Voicebase Status - High API latency caused by multiple issues in AWS including SQS API errors
"The root cause appears to be a number of problem with AWS, including a very high failure rate with the Amazon SQS API, and with the Amazon DynamoDB service"
postmortem 
october 2015
Codeship Status - Intermittent Website Availability Issues
"Codeship's DNS provider has implemented new hardware and networking to overcome their ongoing denial of service attack. They report some name servers are coming back online, but they are still dealing with a partial DNS outage"
postmortem 
october 2015
Linode Status - Network Issues within London Datacenter
"An older generation switch was identified that had a malfunctioning transceiver module. Under normal conditions, the full 1+1 hardware redundancy within the London network fabric would have isolated this failure without any functional impact. However, this transceiver module had not failed completely; rather, the module was experiencing severe voltage fluctuation, causing it to 'flap' in an erratic manner."
postmortem  networks 
october 2015
Keen IO Status - Query Service Errors
"Our internal load balancer (HAProxy) got stuck in an ornery state, and it took us a while to realize it was the load balancer instead of the actual services causing the errors."
postmortem 
october 2015
AWeber Status - Isolated Malware Incident
"We have identified an isolated incident of a website that uses AWeber has been infected by malware. As a response, Google has marked all links from AWeber customers using click tracking (redirecting through clicks.aweber.com) as potential malware"
postmortem 
october 2015
Chargify Status - 4 minute outage
"For approximately 4 minutes we just experienced an unexpected outage due to a failure of thedatabase load balancer to recover from planned maintenance"
postmortem 
october 2015
LiveChat Status - Connection issues
"identified the issue"

"working on a fix"

"fix deployed"

"connection issues caused by memory overload"
postmortem 
october 2015
StatusPage.io Status - Site under heavy load. Pages may be slow or unresponsive.
META

"A large influx of traffic caused most of the web tier to become unresponsive for a period of a few minutes. This influx of traffic as stopped, and site functioning has returned to normal."
postmortem 
october 2015
Switch Status - Inbound Calling Experienced Intermittent Issues
"5 minute issue was caused by a major traffic spike that impacted some inbound call attempts. Further investigation into the nature of these calls is ongoing"
postmortem 
october 2015
Greenhouse Status - AWS US-East-1 Partial Outage
"Our cloud hosting provider (AWS) is currently experiencing significant service degradation. The knock-on effect is decreased response times and reliability for the Greenhouse application, API, and job boards. "
postmortem 
october 2015
VictorOps Status - Isolated Service Disruption
"our database cluster encountered an error condition that caused it to stop processing queries for approximately 30 minutes. During that time, our WebUI and mobile client access was unavailable, and we were unable to process and deliver alerts. We have identified configuration settings in the cluster that will prevent a recurrence of the error condition"
postmortem 
october 2015
Greenhouse Status - Outage
"a new release triggered a doubling of connections to our database--this caused the database to become saturated with connections, causing approximately 10 minutes of system-wide downtime"
postmortem 
october 2015
Route Leak Causes Amazon and AWS Outage
"The forwarding loss combined with the sudden appearance of these two ASNs in the BGP paths strongly suggested a BGP route leak by Axcelx. Looking at the raw BGP data showed the exact BGP updates that resulted in this leak."
postmortem  aws  networks 
october 2015
June 15th Outage — HipChat Blog
" the recent Mac client release which had the much anticipated “multiple account” feature also had a subtle reconnection bug that only manifested under very high load"
postmortem 
october 2015
Mikhail Panchenko [discussion of a long past Flickr problem]
" Inserting data into RDMS indexes is relatively expensive, and usually involves at least some locking. Note that dequeueing jobs also involves an index update, so even marking jobs as in progress or deleting on completion runs into the same locks. So now you have contention from a bunch of producers on a single resource, the updates to which are getting more and more expensive and time consuming. Before long, you're spending more time updating the job index than you are actually performing the jobs. The "queue" essentially fails to perform one of its very basic functions."
postmortem 
october 2015
Kafkapocalypse: a postmortem on our service outage | Parse.ly
" Kafka is so efficient about its resource disk and CPU consumption, that we were running Kafka brokers on relatively modest Amazon EC2 nodes that did not have particularly high network capabilities. At some point, we were hitting operating system network limits and the brokers would simply become unavailable. These limits were probably enforced by Linux, Amazon’s Xen hypervisor, the host machine’s network hardware, or some combination.

The real problem here isn’t failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes. It was a classic Cascading Failure."
postmortem 
october 2015
Travis CI Status - High queue times on OSX builds (.com and .org)
"When we reviewed the resource utilization on our vSphere infrastructure, we discovered we had over 6000 virtual machines on the Xserve cluster. During normal peak build times, this number shouldn't be more than 200."
postmortem 
october 2015
Post-mortem -- S3 Outage | Status.io Blog
"Immediately we realized that the static resources (images, scripts) hosted on Amazon S3 were sporadically failing to load. A quick manual test of the S3 connection confirmed it was broken."
postmortem 
october 2015
Incident documentation/20150814-MediaWiki - Wikitech
"This bug was not caught on the beta cluster, because the code-path is exercised when converting text from one language variant to another, which does not happen frequently in that environment."
postmortem 
october 2015
GitLab.com outage on 2015-09-01 | GitLab
"We are still in the dark about the cause of the NFS slowdowns. We see no spikes of any kind of web requests around the slowdowns. The backend server only shows the ext4 errors mentioned above, which do not coincide with the NFS trouble, and no NFS error messages."
postmortem 
october 2015
Opbeat Status - We're experiencing another major database cluster connectivity issue
"During the master database outages, opbeat.com as well as our intake was unavailable."

" We've reached out to AWS to understand what caused the connectivity issue in the first place, but they have been unable to find the cause."
postmortem 
october 2015
Flying Circus Status - VM performance and stability issues
"After checking the virtualisation servers we saw that many of them had too many virtual machines assigned to them, consuming much more memory than the host actually had. "
"Looking at the algorithm that performed the evacuation when maintenance was due, we found that it located virtual machines to the best possible server. What it did not do was to prohibit machines being placed on hosts that already have too many machines. "
postmortem 
october 2015
Faithlife Status - Leaf Switch Outage
"we lost all connectivity to the switch. At that point, it failed to continue passing traffic. This should not have been a problem since compute nodes have a link to each switch. The compute nodes should have recognized that link in the aggregation as “down” and discontinued use of that link. However, our compute nodes continued sending traffic to that link. To verify the compute nodes didn’t incorrectly see the downed link as “up”, we physically disconnected the links to the degraded switch. Unfortunately, the compute nodes still attempted to send traffic over that link. "
postmortem  networks 
october 2015
Outage report: 5 September 2015 - PythonAnywhere News
"t looked like a massive earlier spike in read activity across the filesystem had put some system processes into a strange state. User file storage is shared from the file storage system over to the servers where people's websites and consoles actually run over NFS. And the storage itself on the NFS servers uses DRBD for replication. As far as we could determine, a proportion (at least half) of the NFS processes were hanging, and the DRBD processes were running inexplicably slowly. This was causing access to file storage to frequently be slow, and to occasionally fail, for all servers using the file storage."
postmortem 
october 2015
Why did Stack Overflow time out? - Meta Stack Exchange
"A cascading failure of the firewall to apply properly leading keepalived VRRP unable to communicate properly made both load balancers think neither had a peer. This results in a bit of swapping as they fight for ARP. When NY-LB06 "won" that fight, the second failure came into play: the firewall module did not finish updating on the first puppet run meaning the server was fully ready to serve traffic (from a Layer 7/HAProxy standpoint), but was not accepting TCP connections from anyone yet."
postmortem 
october 2015
SNworks Status - Connectivity Issues
"The cache server is super important. Without it, our web servers can serve 10s of requests a second. With it, they can serve 1000s of requests per second.

Large news events + uncached pages = servers not being able to handle traffic demands."
postmortem 
october 2015
Customer.io Status - Extended outage since 11:41 pm EDT
" After bringing down the cluster, we upgraded FoundationDB and attempted to bring the cluster back up.

An error prevented the cluster from returning to "operational"."
postmortem 
october 2015
Partial Photon Cloud outage on 04/30/2015 | Blog | Photon: Multiplayer Made Simple
"The root cause was an outage of Microsoft Azure’s Network Infrastructure"
postmortem 
october 2015
Simplero Status - Down right now
"The root cause was a disk that filled up on a secondary database server.

it happened while I was at event at the Empire State Building. At first I tried fixing it via my iPhone, but I had to realize that I couldn't, and instead hop on a train back home to my laptop."
postmortem 
october 2015
Divshot Status - Serious Platform Outage
"This morning around 7:20am Pacific time, several platform EC2 instances began failing and our load balancer began returning 503 errors. Ordinarily our scaling configuration would terminate and replace unhealthy instances, but for an as-yet-undetermined reason all instances became unhealthy and were not replaced. This caused a widespread outage that lasted for nearly two hours."
postmortem 
october 2015
More Details on Today's Outage [2010]
"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.

The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid."
postmortem 
october 2015
OHIO: Office of Information Technology |Anatomy of a network outage: Aug 20 - Sep 4, 2015
"If an IT system has a hidden weakness, fall opening will expose it.

This year, the 17,000 additional devices that students and staff brought with them to campus did just that, overwhelming our core network for over two weeks."
postmortem  networks 
october 2015
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
"Our automated tooling actually filed it as two separate change requests: one to add a new database index and a second to remove the old database index. Both of these change requests were reflected in our dashboard for database operators, intermingled with many other alerts. The dashboard did not indicate that the deletion request depended on the successful completion of the addition request"
postmortem 
october 2015
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
"But, on Sunday morning, a portion of the metadata service responses exceeded the retrieval and transmission time allowed by storage servers. As a result, some of the storage servers were unable to obtain their membership data, and removed themselves from taking requests"

"With a larger size, the processing time inside the metadata service for some membership requests began to approach the retrieval time allowance by storage servers. We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests."
postmortem  aws 
september 2015
Postmortem for July 27 outage of the Manta service - Blog - Joyent
"Clients experienced very high latency, and ultimately received 500-level errors in response to about 22% of all types of requests, including PUT, GET, and DELETE for both objects and directories. At peak, the error rate approached 27% of all requests, and for most of the window the error rate varied between 19 and 23%."
postmortem 
august 2015
Travis CI Status - Elevated wait times and timeouts for OSX builds
"We noticed an error pointing towards build VM boot timeouts at 23:21 UTC on the 20th. After discussion with our infrastructure provider, it was shown to us that our SAN (a NetApp appliance) was being overloaded due to a spike in disk operations per second."
postmortem 
july 2015
CircleCI Status - DB performance issue
"The degradation in DB performance was a special kind of non-linear, going from "everything is fine" to "fully unresponsive" within 2 minutes. Symptoms included a long list of queued builds, and each query taking a massive amount of time to run, along side many queries timing out."

"CircleCI is written in Clojure, a form of Lisp. One of the major advantages of this type of language is that you can recompile code live at run-time. We typically use Immutable Architecture, in which we deploy pre-baked machine images to put out new code. This works well for keeping the system in a clean state, as part of a continuous delivery model. Unfortunately, when things are on fire, it doesn't allow us to move as quickly as we would like.

This is where Clojure's live patching comes in. By connecting directly to our production machines and connecting to the Clojure REPL, we can change code live and in production. We've built tooling over the last few years to automate this across the hundreds of machines we run at a particular time.

This has saved us a number of times in the past, turning many potential-crise into mere blips. Here, this allowed us to disable queries, fix bugs and otherwise swap out and disable undesirable code when needed."
postmortem 
july 2015
NYSE Blames Trading Outage on Software Upgrade | Traders Magazine Online News
"On Tuesday evening, the NYSE began the rollout of a software release in preparation for the July 11 industry test of the upcoming SIP timestamp requirement. As is standard NYSE practice, the initial release was deployed on one trading unit. As customers began connecting after 7am on Wednesday morning, there were communication issues between customer gateways and the trading unit with the new release. It was determined that the NYSE and NYSE MKT customer gateways were not loaded with the proper configuration compatible with the new release."
postmortem 
july 2015
Elevated latency and error rate for Google Compute Engine API - Google Groups
"However, a software bug in the GCE control plane interacted
poorly with this change and caused API requests directed to us-central1-a
to be rejected starting at 03:21 PDT. Retries and timeouts from the failed
calls caused increased load on other API backends, resulting in higher
latency for all GCE API calls. The API issues were resolved when Google
engineers identified the control plane issue and corrected it at 04:59 PDT,
with the backlog fully cleared by 05:12 PDT. "
google  postmortem 
may 2015
Code Climate Status - Inaccurate Analysis Results
"Also on May 8th, we deployed instrumentation and logging to track when our cached Git blob data did not match the actual contents on disk. We found no further mismatches on new analyses, supporting the theory that the issue was ephemeral and no longer present.

Around this time we began a process of re-running old analyses that had failed, and were able to reproduce the issue. This was a critical learning, because it refuted the theory that the issue was ephemeral. With this information, we took a closer look at the objects in the analysis-level cache. We discovered that these marshaled Ruby objects did not in fact hold a reference to the contents of files as we originally believed. Problematically, the object held a reference to the Git service URL to use for remote procedure calls.

When a repository was migrated, this cache key was untouched. This outdated reference led to cat-file calls being issued to the old server instead of the new server"
postmortem 
may 2015
Stack Exchange Network Status — Outage Postmortem: January 6th, 2015
"With no way to get our main IP addresses accessible to most users, our options were to either fail over to our DR datacenter in read-only mode, or to enable CloudFlare - we’ve been testing using them for DDoS mitigation, and have separate ISP links in the NY datacenter which are dedicated to traffic from them.

We decided to turn on CloudFlare, which caused a different problem - caused by our past selves."
postmortem 
april 2015
DripStat — Post mortem of yesterday's outage
"1. RackSpace had an outage in their Northern Virginia region.
2. We were getting DDOS’d.
3. The hypervisor Rackspace deployed our cloud server on was running into issue and would keep killing our java process.
We were able to diagnose 2 and 3 only after Rackspace recovered from their long load balancer outage. The fact that all 3 happened at the same time did not help issues either."
postmortem 
april 2015
Blog - Tideways
"On Wednesday 6:05 Europe/Berlin time our Elasticsearch cluster went down when it ran OutOfMemory and file descriptors. One node of the cluster did not recover from this error anymore and the other responded to queries with failure.

The workers processing the performance and trace event log data with Beanstalk message queue stopped.
"
postmortem 
april 2015
« earlier      
automation aws build business cache cloudgeneral concurrency configuration consistency data debug devgeneral distributedgeneral dns functional git google government graphdb ha hadoop hpccloud http humor illumos internet io java joyent jvm legal loadbalancing logging madison management messages monitoring networks nginx nosql paas performance politics postgresql postmortem privacy python redis rest riak scala scraping security semantic sql storage testing timf turk video virtualization visualization

Copy this bookmark:



description:


tags: