peakscale + postmortem   342

Incident 1290 | Heroku Status
"The routing layer that directs traffic from the internet to customer dynos has an extremely slow memory leak that has existed for some time. Typically, this memory leak has been mitigated by regular deploys of the router. Recently, however, deploys to this component have been less frequent. At some point, we crossed a tipping point and the memory leak was no longer automatically remediated by ongoing deployments.

This memory leak caused the processes in the routing layer to be killed and restarted. During this period of the processes restarting, they were unable to receive traffic and connections received EOF responses."
postmortem 
5 weeks ago by peakscale
PagerDuty Status - Delayed Notifications
"Degraded performance of one of our Cassandra database clusters caused delays outside tolerance limits to the delivery of notifications and the dispatching of webhooks. The degradation in performance was triggered during the replacement of a failed virtual machine in the cluster. This maintenance was unplanned, as the failure of the host was unexpected.

The procedure used to replace the failed node triggered a chain reaction of load on other nodes in the cluster, which hampered this cluster’s ability to do its primary job of processing notifications."
postmortem 
5 weeks ago by peakscale
Google Cloud Networking Incident #17002
"Any GCE instance that was live-migrated between 13:56 PDT on Tuesday 29 August 2017 and 08:32 on Wednesday 30 August 2017 became unreachable via Google Cloud Network or Internal Load Balancing until between 08:56 and 14:18 (for regions other than us-central1) or 20:16 (for us-central1) on Wednesday. See https://goo.gl/NjqQ31 for a visual representation of the cumulative number of instances live-migrated over time.

Our internal investigation shows that, at peak, 2% of GCE instances were affected by the issue."
google  postmortem 
8 weeks ago by peakscale
Postmortem: 2017-04-11 Firewall Outage | Circonus
"We use a pair of firewall devices in an active/passive configuration with automatic failover should one of the devices become unresponsive. The firewall device in question went down, and automatic failover did not trigger for an unknown reason (we are still investigating). When we realized the problem, we killed off the bad firewall device, causing the secondary to promote itself to master and service to be restored."
postmortem 
august 2017 by peakscale
Requests to Google Cloud Storage (GCS) JSON API experienced elevated error rates for a period of 3 hours and 15 minutes
"A low-level software defect in an internal API service that handles GCS JSON requests caused infrequent memory-related process terminations. These process terminations increased as a result of a large volume in requests to the GCS Transfer Service, which uses the same internal API service as the GCS JSON API. This caused an increased rate of 503 responses for GCS JSON API requests for 3.25 hours."
postmortem  google 
july 2017 by peakscale
What did OVH learn from 24-hour outage? Water and servers do not mix
Including an article because the original incident log is in French.
postmortem 
july 2017 by peakscale
Google Cloud Status Dashboard
"At the time of incident, Google engineers were upgrading the network topology and capacity of the region; a configuration error caused the existing links to be decommissioned before the replacement links could provide connectivity, resulting in a loss of connectivity for the asia-northeast1 region. Although the replacement links were already commissioned and appeared to be ready to serve, a network-routing protocol misconfiguration meant that the routes through those links were not able to carry traffic."
postmortem  google 
june 2017 by peakscale
Update on the April 5th, 2017 Outage
"Within three minutes of the initial alerts, we discovered that our primary database had been deleted. Four minutes later we commenced the recovery process, using one of our time-delayed database replicas. Over the next four hours, we copied and restored the data to our primary and secondary replicas. The duration of the outage was due to the time it took to copy the data between the replicas and restore it into an active server."
postmortem 
april 2017 by peakscale
Google Cloud Status Dashboard
"On Monday 30 January 2017, newly created Google Compute Engine instances, Cloud VPNs and network load balancers were unavailable for a duration of 2 hours 8 minutes."
postmortem 
february 2017 by peakscale
The Travis CI Blog: The day we deleted our VM images
"In addition, our cleanup service had been briefly disabled to troubleshooting a potential race condition. Then we turned the automated cleanup back on. The service had a default hard coded amount of how many image names to query from our internal image catalog and it was set to 100.

When we started the cleanup service, the list of 100 image names, sorted by newest first, did not include our stable images, which were the oldest, did not get included in the results. Our cleanup service then promptly started deleting the older images from GCE, because its view of the world told it that those older images where no longer in use, i.e it looked like they were not in our catalog and all of our stable images got irrevocably deleted.

This immediately stopped builds from running. "
postmortem 
september 2016 by peakscale
Google Cloud Status Dashboard
"While removing a faulty router from service, a new procedure for diverting traffic from the router was used. This procedure applied a new configuration that resulted in announcing some Google Cloud Platform IP addresses from a single point of presence in the southwestern US. As these announcements were highly specific they took precedence over the normal routes to Google's network and caused a substantial proportion of traffic for the affected network ranges to be directed to this one point of presence. This misrouting directly caused the additional latency some customers experienced.

Additionally this misconfiguration sent affected traffic to next-generation infrastructure that was undergoing testing. This new infrastructure was not yet configured to handle Cloud Platform traffic and applied an overly-restrictive packet filter."
postmortem  google 
august 2016 by peakscale
Stack Exchange Network Status — Outage Postmortem - July 20, 2016
"The direct cause was a malformed post that caused one of our regular expressions to consume high CPU on our web servers. The post was in the homepage list, and that caused the expensive regular expression to be called on each home page view. "
postmortem 
july 2016 by peakscale
Summary of the AWS Service Event in the Sydney Region
"The service disruption primarily affected EC2 instances and their associated Elastic Block Store (“EBS”) volumes running in a single Availability Zone. "
aws  postmortem 
june 2016 by peakscale
Crates.io is down [fixed] - The Rust Programming Language Forum
OK, a quick post-mortem:

At 9:45 AM PST I got a ping that crates.io was down and started looking into it. Connections via the website and from the 'cargo' command were timing out. From Heroku's logs it looks like the timeouts began around 9:10 AM.

From looking at logs (18 28) it's clear that connections were timing out, and that a number of postgres queries were blocked updating the download statistics10. These queries were occupying all available connections.

After killing outstanding queries the site is working again. It's not clear yet what the original cause was.
postmortem 
june 2016 by peakscale
SNOW Status - Elevated Errors on SNOW Backend
"Todays outage was because of a mis-configuration in our Redis cluster, where we didn't automatically prune stale cache keys."
postmortem 
may 2016 by peakscale
Postmortem: A tale of how Discourse almost took us out.
"TL;DR

This morning we noticed that Sidekiq had 13K jobs, it quickly escalated to 14K and then 17K and kept growing, for reasons we do not understand yet. We know this was initially cause by a large backlog of emails that needed to be sent because of exceptions that were occurring due to this bug, this is when things got interesting, and got wildly out of control."
postmortem 
may 2016 by peakscale
Elastic Cloud Outage: Root Cause and Impact Analysis | Elastic
"What happened behind the scenes was that our Apache ZooKeeper cluster lost quorum, for the first time in more than three years. After recent maintenance, a heap space misconfiguration on the new nodes resulted in high memory pressure on the ZooKeeper quorum nodes, causing ZooKeeper to spend almost all CPU garbage collecting. When an auxiliary service that watches a lot of the ZooKeeper database reconnected, this threw ZooKeeper over the top, which in turn caused other services to reconnect – resulting in a thundering herd effect that exacerbated the problem."
postmortem 
may 2016 by peakscale
Connectivity issues with Cloud VPN in asia-east1 - Google Groups
"On Monday, 11 April, 2016, Google Compute Engine instances in all regions
lost external connectivity for a total of 18 minutes"
google  postmortem 
april 2016 by peakscale
**Gliffy Online System Outage** : Gliffy Support Desk
"On working to resolve the issue, an administrator accidentally deleted the production database."
postmortem 
march 2016 by peakscale
What Happened: Adobe Creative Cloud Update Bug
"Wednesday night we started getting support tickets relating to the .bzvol file being removed from computers. Normally the pop-up sends people to this (which we have since edited to highlight this current issue): bzvol webpage. The problem was, the folks on Mac kept reporting that our fix did not work and that they kept getting the error. Our support team contacted our lead Mac developer for help trying to troubleshoot and figure out what was causing this surge."
postmortem 
february 2016 by peakscale
January 28th Incident Report · GitHub
"Our early response to the event was complicated by the fact that many of our ChatOps systems were on servers that had rebooted. We do have redundancy built into our ChatOps systems, but this failure still caused some amount of confusion and delay at the very beginning of our response. "

"We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code."
postmortem 
february 2016 by peakscale
Linode Blog » The Twelve Days of Crisis – A Retrospective on Linode’s Holiday DDoS Attacks
"Lesson one: don’t depend on middlemen
Lesson two: absorb larger attacks
Lesson three: let customers know what’s happening"
postmortem  networks 
january 2016 by peakscale
Outage postmortem (2015-12-17 UTC) : Stripe: Help & Support
"the retry feedback loop and associated performance degradation prevented us from accepting new API requests at all"
postmortem 
december 2015 by peakscale
Postmortem: Outage due to Elasticsearch’s flexibility and our carelessness
"To add promotions in the same Elasticsearch, the CMS team decided to add a new doctype/mapping in Elasticsearch called promotions. The feature was tested locally by the developer and it worked fine. No issues were caught anywhere during testing and the code was pushed to production.

The feature came into use when our content team started curating the content past midnight for the sale starting the next day. Once they added the content and started the re-indexing procedure (which is a manual button click), our consumer app stopped working. As soon as the team started indexing the content around 4:30 AM, our app stopped working. Our search queries started returning NumberFormatException (add more details here) on our price field."
postmortem 
december 2015 by peakscale
400 errors when trying to create an external (L2) Load Balancer for GCE/GKE services - Google Groups
"a minor update to the Compute Engine API inadvertently changed the case-sensitivity of the “sessionAffinity” enum variable in the target pool definition, and this variation was not covered by testing."
postmortem  google 
december 2015 by peakscale
Network Connectivity and Latency Issues in Europe
"On Tuesday, 10 November 2015, outbound traffic going through one of our
European routers from both Google Compute Engine and Google App Engine
experienced high latency for a duration of 6h43m minutes. If your service
or application was affected, we apologize — this is not the level of
quality and reliability we strive to offer you, and we have taken and are
taking immediate steps to improve the platform’s performance and
availability. "
postmortem  google  networks 
november 2015 by peakscale
Spreedly Status - 503 Service Unavailable
"The issue here was an unbounded queue. We'll address that by leveraging rsyslog's advanced queueing options without neglecting one very important concern: certain activities must always be logged in the system. Also, we need to know when rsyslog is unable to work off it's queue, so we are going to find a way to be alerted as soon as that is the case."
postmortem 
october 2015 by peakscale
CircleCI Status - Load balancer misconfiguration
"A load balancer misconfiguration briefly prevented us from serving content. We caught the problem and are fixing"
postmortem 
october 2015 by peakscale
Voicebase Status - High API latency caused by multiple issues in AWS including SQS API errors
"The root cause appears to be a number of problem with AWS, including a very high failure rate with the Amazon SQS API, and with the Amazon DynamoDB service"
postmortem 
october 2015 by peakscale
Codeship Status - Intermittent Website Availability Issues
"Codeship's DNS provider has implemented new hardware and networking to overcome their ongoing denial of service attack. They report some name servers are coming back online, but they are still dealing with a partial DNS outage"
postmortem 
october 2015 by peakscale
Linode Status - Network Issues within London Datacenter
"An older generation switch was identified that had a malfunctioning transceiver module. Under normal conditions, the full 1+1 hardware redundancy within the London network fabric would have isolated this failure without any functional impact. However, this transceiver module had not failed completely; rather, the module was experiencing severe voltage fluctuation, causing it to 'flap' in an erratic manner."
postmortem  networks 
october 2015 by peakscale
Keen IO Status - Query Service Errors
"Our internal load balancer (HAProxy) got stuck in an ornery state, and it took us a while to realize it was the load balancer instead of the actual services causing the errors."
postmortem 
october 2015 by peakscale
AWeber Status - Isolated Malware Incident
"We have identified an isolated incident of a website that uses AWeber has been infected by malware. As a response, Google has marked all links from AWeber customers using click tracking (redirecting through clicks.aweber.com) as potential malware"
postmortem 
october 2015 by peakscale
Chargify Status - 4 minute outage
"For approximately 4 minutes we just experienced an unexpected outage due to a failure of thedatabase load balancer to recover from planned maintenance"
postmortem 
october 2015 by peakscale
LiveChat Status - Connection issues
"identified the issue"

"working on a fix"

"fix deployed"

"connection issues caused by memory overload"
postmortem 
october 2015 by peakscale
StatusPage.io Status - Site under heavy load. Pages may be slow or unresponsive.
META

"A large influx of traffic caused most of the web tier to become unresponsive for a period of a few minutes. This influx of traffic as stopped, and site functioning has returned to normal."
postmortem 
october 2015 by peakscale
Switch Status - Inbound Calling Experienced Intermittent Issues
"5 minute issue was caused by a major traffic spike that impacted some inbound call attempts. Further investigation into the nature of these calls is ongoing"
postmortem 
october 2015 by peakscale
Greenhouse Status - AWS US-East-1 Partial Outage
"Our cloud hosting provider (AWS) is currently experiencing significant service degradation. The knock-on effect is decreased response times and reliability for the Greenhouse application, API, and job boards. "
postmortem 
october 2015 by peakscale
VictorOps Status - Isolated Service Disruption
"our database cluster encountered an error condition that caused it to stop processing queries for approximately 30 minutes. During that time, our WebUI and mobile client access was unavailable, and we were unable to process and deliver alerts. We have identified configuration settings in the cluster that will prevent a recurrence of the error condition"
postmortem 
october 2015 by peakscale
Greenhouse Status - Outage
"a new release triggered a doubling of connections to our database--this caused the database to become saturated with connections, causing approximately 10 minutes of system-wide downtime"
postmortem 
october 2015 by peakscale
Route Leak Causes Amazon and AWS Outage
"The forwarding loss combined with the sudden appearance of these two ASNs in the BGP paths strongly suggested a BGP route leak by Axcelx. Looking at the raw BGP data showed the exact BGP updates that resulted in this leak."
postmortem  aws  networks 
october 2015 by peakscale
June 15th Outage — HipChat Blog
" the recent Mac client release which had the much anticipated “multiple account” feature also had a subtle reconnection bug that only manifested under very high load"
postmortem 
october 2015 by peakscale
Mikhail Panchenko [discussion of a long past Flickr problem]
" Inserting data into RDMS indexes is relatively expensive, and usually involves at least some locking. Note that dequeueing jobs also involves an index update, so even marking jobs as in progress or deleting on completion runs into the same locks. So now you have contention from a bunch of producers on a single resource, the updates to which are getting more and more expensive and time consuming. Before long, you're spending more time updating the job index than you are actually performing the jobs. The "queue" essentially fails to perform one of its very basic functions."
postmortem 
october 2015 by peakscale
Kafkapocalypse: a postmortem on our service outage | Parse.ly
" Kafka is so efficient about its resource disk and CPU consumption, that we were running Kafka brokers on relatively modest Amazon EC2 nodes that did not have particularly high network capabilities. At some point, we were hitting operating system network limits and the brokers would simply become unavailable. These limits were probably enforced by Linux, Amazon’s Xen hypervisor, the host machine’s network hardware, or some combination.

The real problem here isn’t failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes. It was a classic Cascading Failure."
postmortem 
october 2015 by peakscale
Travis CI Status - High queue times on OSX builds (.com and .org)
"When we reviewed the resource utilization on our vSphere infrastructure, we discovered we had over 6000 virtual machines on the Xserve cluster. During normal peak build times, this number shouldn't be more than 200."
postmortem 
october 2015 by peakscale
Post-mortem -- S3 Outage | Status.io Blog
"Immediately we realized that the static resources (images, scripts) hosted on Amazon S3 were sporadically failing to load. A quick manual test of the S3 connection confirmed it was broken."
postmortem 
october 2015 by peakscale
Incident documentation/20150814-MediaWiki - Wikitech
"This bug was not caught on the beta cluster, because the code-path is exercised when converting text from one language variant to another, which does not happen frequently in that environment."
postmortem 
october 2015 by peakscale
GitLab.com outage on 2015-09-01 | GitLab
"We are still in the dark about the cause of the NFS slowdowns. We see no spikes of any kind of web requests around the slowdowns. The backend server only shows the ext4 errors mentioned above, which do not coincide with the NFS trouble, and no NFS error messages."
postmortem 
october 2015 by peakscale
Opbeat Status - We're experiencing another major database cluster connectivity issue
"During the master database outages, opbeat.com as well as our intake was unavailable."

" We've reached out to AWS to understand what caused the connectivity issue in the first place, but they have been unable to find the cause."
postmortem 
october 2015 by peakscale
Flying Circus Status - VM performance and stability issues
"After checking the virtualisation servers we saw that many of them had too many virtual machines assigned to them, consuming much more memory than the host actually had. "
"Looking at the algorithm that performed the evacuation when maintenance was due, we found that it located virtual machines to the best possible server. What it did not do was to prohibit machines being placed on hosts that already have too many machines. "
postmortem 
october 2015 by peakscale
Faithlife Status - Leaf Switch Outage
"we lost all connectivity to the switch. At that point, it failed to continue passing traffic. This should not have been a problem since compute nodes have a link to each switch. The compute nodes should have recognized that link in the aggregation as “down” and discontinued use of that link. However, our compute nodes continued sending traffic to that link. To verify the compute nodes didn’t incorrectly see the downed link as “up”, we physically disconnected the links to the degraded switch. Unfortunately, the compute nodes still attempted to send traffic over that link. "
postmortem  networks 
october 2015 by peakscale
Outage report: 5 September 2015 - PythonAnywhere News
"t looked like a massive earlier spike in read activity across the filesystem had put some system processes into a strange state. User file storage is shared from the file storage system over to the servers where people's websites and consoles actually run over NFS. And the storage itself on the NFS servers uses DRBD for replication. As far as we could determine, a proportion (at least half) of the NFS processes were hanging, and the DRBD processes were running inexplicably slowly. This was causing access to file storage to frequently be slow, and to occasionally fail, for all servers using the file storage."
postmortem 
october 2015 by peakscale
Why did Stack Overflow time out? - Meta Stack Exchange
"A cascading failure of the firewall to apply properly leading keepalived VRRP unable to communicate properly made both load balancers think neither had a peer. This results in a bit of swapping as they fight for ARP. When NY-LB06 "won" that fight, the second failure came into play: the firewall module did not finish updating on the first puppet run meaning the server was fully ready to serve traffic (from a Layer 7/HAProxy standpoint), but was not accepting TCP connections from anyone yet."
postmortem 
october 2015 by peakscale
SNworks Status - Connectivity Issues
"The cache server is super important. Without it, our web servers can serve 10s of requests a second. With it, they can serve 1000s of requests per second.

Large news events + uncached pages = servers not being able to handle traffic demands."
postmortem 
october 2015 by peakscale
Customer.io Status - Extended outage since 11:41 pm EDT
" After bringing down the cluster, we upgraded FoundationDB and attempted to bring the cluster back up.

An error prevented the cluster from returning to "operational"."
postmortem 
october 2015 by peakscale
Partial Photon Cloud outage on 04/30/2015 | Blog | Photon: Multiplayer Made Simple
"The root cause was an outage of Microsoft Azure’s Network Infrastructure"
postmortem 
october 2015 by peakscale
Simplero Status - Down right now
"The root cause was a disk that filled up on a secondary database server.

it happened while I was at event at the Empire State Building. At first I tried fixing it via my iPhone, but I had to realize that I couldn't, and instead hop on a train back home to my laptop."
postmortem 
october 2015 by peakscale
Divshot Status - Serious Platform Outage
"This morning around 7:20am Pacific time, several platform EC2 instances began failing and our load balancer began returning 503 errors. Ordinarily our scaling configuration would terminate and replace unhealthy instances, but for an as-yet-undetermined reason all instances became unhealthy and were not replaced. This caused a widespread outage that lasted for nearly two hours."
postmortem 
october 2015 by peakscale
More Details on Today's Outage [2010]
"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.

The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid."
postmortem 
october 2015 by peakscale
OHIO: Office of Information Technology |Anatomy of a network outage: Aug 20 - Sep 4, 2015
"If an IT system has a hidden weakness, fall opening will expose it.

This year, the 17,000 additional devices that students and staff brought with them to campus did just that, overwhelming our core network for over two weeks."
postmortem  networks 
october 2015 by peakscale
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
"Our automated tooling actually filed it as two separate change requests: one to add a new database index and a second to remove the old database index. Both of these change requests were reflected in our dashboard for database operators, intermingled with many other alerts. The dashboard did not indicate that the deletion request depended on the successful completion of the addition request"
postmortem 
october 2015 by peakscale
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
"But, on Sunday morning, a portion of the metadata service responses exceeded the retrieval and transmission time allowed by storage servers. As a result, some of the storage servers were unable to obtain their membership data, and removed themselves from taking requests"

"With a larger size, the processing time inside the metadata service for some membership requests began to approach the retrieval time allowance by storage servers. We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests."
postmortem  aws 
september 2015 by peakscale
Postmortem for July 27 outage of the Manta service - Blog - Joyent
"Clients experienced very high latency, and ultimately received 500-level errors in response to about 22% of all types of requests, including PUT, GET, and DELETE for both objects and directories. At peak, the error rate approached 27% of all requests, and for most of the window the error rate varied between 19 and 23%."
postmortem 
august 2015 by peakscale
« earlier      
per page:    204080120160

Copy this bookmark:



description:


tags: