peakscale + postmortem 356
Google Cloud - issue with Google Cloud Global Loadbalancers returning 502s
5 weeks ago by peakscale
On Tuesday, 17 July 2018, from 12:17 to 12:49 PDT, Google Cloud HTTP(S) Load Balancers returned 502s for some requests they received. The proportion of 502 return codes varied from 33% to 87% during the period. Automated monitoring alerted Google’s engineering team to the event at 12:19, and at 12:44 the team had identified the probable root cause and deployed a fix.
google
postmortem
5 weeks ago by peakscale
Google Cloud Status Dashboard
7 weeks ago by peakscale
"Configuration changes being rolled out on the evening of the incident were not applied in the intended order. This resulted in an incomplete configuration change becoming live in some zones, subsequently triggering the failure of customer jobs. During the process of rolling back the configuration, another incorrect configuration change was inadvertently applied, causing the second batch of job failures."
postmortem
7 weeks ago by peakscale
Incident Management at Spotify | Labs
7 weeks ago by peakscale
"A few weeks ago Spotify had one of the biggest incidents in the last few years. It caused a major outage for a big chunk of our European users. For a few hours the music playback experience was damaged. Our users would see high latency when playing music and some of them were unable to log in.
[...]
Two months before the big outage we had an incident connected with one of our smallest backend services: Popcount. Popcount (this is our internal name) is the service that takes care of storing the list of subscribers for each of our more than 1 billion playlists."
postmortem
[...]
Two months before the big outage we had an incident connected with one of our smallest backend services: Popcount. Popcount (this is our internal name) is the service that takes care of storing the list of subscribers for each of our more than 1 billion playlists."
7 weeks ago by peakscale
Github/2018-06-28 - Gentoo Wiki
7 weeks ago by peakscale
"
An unknown entity gained control of an admin account for the Gentoo GitHub Organization and removed all access to the organization (and its repositories) from Gentoo developers. They then proceeded to make various changes to content. Gentoo Developers & Infrastructure escalated to GitHub support and the Gentoo Organization was frozen by GitHub staff. Gentoo has regained control of the Gentoo GitHub Organization and has reverted the bad commits and defaced content. "
postmortem
security
An unknown entity gained control of an admin account for the Gentoo GitHub Organization and removed all access to the organization (and its repositories) from Gentoo developers. They then proceeded to make various changes to content. Gentoo Developers & Infrastructure escalated to GitHub support and the Gentoo Organization was frozen by GitHub staff. Gentoo has regained control of the Gentoo GitHub Organization and has reverted the bad commits and defaced content. "
7 weeks ago by peakscale
Today we mitigated 1.1.1.1
12 weeks ago by peakscale
"Today, in an effort to reclaim some technical debt, we deployed new code that introduced Gatebot to Provision API.
What we did not account for, and what Provision API didn’t know about, was that 1.1.1.0/24 and 1.0.0.0/24 are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.
As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!"
postmortem
security
networks
What we did not account for, and what Provision API didn’t know about, was that 1.1.1.0/24 and 1.0.0.0/24 are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.
As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!"
12 weeks ago by peakscale
Incident review: API and Dashboard outage on 10 October 2017 — GoCardless Blog
may 2018 by peakscale
"On the afternoon of 10 October 2017, we experienced an outage of our API and Dashboard, lasting 1 hour and 50 minutes. Any requests made during that time failed, and returned an error.
The cause of the incident was a hardware failure on our primary database node, combined with unusual circumstances that prevented our database cluster automation from promoting one of the replica database nodes to act as the new primary.
This failure to promote a new primary database node extended an outage that would normally last 1 or 2 minutes to one that lasted almost 2 hours."
postmortem
The cause of the incident was a hardware failure on our primary database node, combined with unusual circumstances that prevented our database cluster automation from promoting one of the replica database nodes to act as the new primary.
This failure to promote a new primary database node extended an outage that would normally last 1 or 2 minutes to one that lasted almost 2 hours."
may 2018 by peakscale
Google Cloud Status Dashboard
may 2018 by peakscale
"On Wednesday 16 May 2018, Google Cloud Networking experienced loss of connectivity to external IP addresses located in us-east4 for a duration of 58 minutes."
postmortem
google
may 2018 by peakscale
February 28th DDoS Incident Report | GitHub Engineering
march 2018 by peakscale
"On Wednesday, February 28, 2018 GitHub.com was unavailable from 17:21 to 17:26 UTC and intermittently unavailable from 17:26 to 17:30 UTC due to a distributed denial-of-service (DDoS) attack."
postmortem
security
march 2018 by peakscale
Epic Games' Fortnite
february 2018 by peakscale
"Fortnite hit a new peak of 3.4 million concurrent players last Sunday… and that didn’t come without issues! The extreme load caused 6 different incidents between Saturday and Sunday, with a mix of partial and total service disruptions to Fortnite."
postmortem
february 2018 by peakscale
Incident 1290 | Heroku Status
october 2017 by peakscale
"The routing layer that directs traffic from the internet to customer dynos has an extremely slow memory leak that has existed for some time. Typically, this memory leak has been mitigated by regular deploys of the router. Recently, however, deploys to this component have been less frequent. At some point, we crossed a tipping point and the memory leak was no longer automatically remediated by ongoing deployments.
This memory leak caused the processes in the routing layer to be killed and restarted. During this period of the processes restarting, they were unable to receive traffic and connections received EOF responses."
postmortem
This memory leak caused the processes in the routing layer to be killed and restarted. During this period of the processes restarting, they were unable to receive traffic and connections received EOF responses."
october 2017 by peakscale
PagerDuty Status - Delayed Notifications
october 2017 by peakscale
"Degraded performance of one of our Cassandra database clusters caused delays outside tolerance limits to the delivery of notifications and the dispatching of webhooks. The degradation in performance was triggered during the replacement of a failed virtual machine in the cluster. This maintenance was unplanned, as the failure of the host was unexpected.
The procedure used to replace the failed node triggered a chain reaction of load on other nodes in the cluster, which hampered this cluster’s ability to do its primary job of processing notifications."
postmortem
The procedure used to replace the failed node triggered a chain reaction of load on other nodes in the cluster, which hampered this cluster’s ability to do its primary job of processing notifications."
october 2017 by peakscale
Google Cloud Networking Incident #17002
september 2017 by peakscale
"Any GCE instance that was live-migrated between 13:56 PDT on Tuesday 29 August 2017 and 08:32 on Wednesday 30 August 2017 became unreachable via Google Cloud Network or Internal Load Balancing until between 08:56 and 14:18 (for regions other than us-central1) or 20:16 (for us-central1) on Wednesday. See https://goo.gl/NjqQ31 for a visual representation of the cumulative number of instances live-migrated over time.
Our internal investigation shows that, at peak, 2% of GCE instances were affected by the issue."
google
postmortem
Our internal investigation shows that, at peak, 2% of GCE instances were affected by the issue."
september 2017 by peakscale
Postmortem: 2017-04-11 Firewall Outage | Circonus
august 2017 by peakscale
"We use a pair of firewall devices in an active/passive configuration with automatic failover should one of the devices become unresponsive. The firewall device in question went down, and automatic failover did not trigger for an unknown reason (we are still investigating). When we realized the problem, we killed off the bad firewall device, causing the secondary to promote itself to master and service to be restored."
postmortem
august 2017 by peakscale
Requests to Google Cloud Storage (GCS) JSON API experienced elevated error rates for a period of 3 hours and 15 minutes
july 2017 by peakscale
"A low-level software defect in an internal API service that handles GCS JSON requests caused infrequent memory-related process terminations. These process terminations increased as a result of a large volume in requests to the GCS Transfer Service, which uses the same internal API service as the GCS JSON API. This caused an increased rate of 503 responses for GCS JSON API requests for 3.25 hours."
postmortem
google
july 2017 by peakscale
What did OVH learn from 24-hour outage? Water and servers do not mix
july 2017 by peakscale
Including an article because the original incident log is in French.
postmortem
july 2017 by peakscale
Google Cloud Status Dashboard
june 2017 by peakscale
"At the time of incident, Google engineers were upgrading the network topology and capacity of the region; a configuration error caused the existing links to be decommissioned before the replacement links could provide connectivity, resulting in a loss of connectivity for the asia-northeast1 region. Although the replacement links were already commissioned and appeared to be ready to serve, a network-routing protocol misconfiguration meant that the routes through those links were not able to carry traffic."
postmortem
google
june 2017 by peakscale
Update on the April 5th, 2017 Outage
april 2017 by peakscale
"Within three minutes of the initial alerts, we discovered that our primary database had been deleted. Four minutes later we commenced the recovery process, using one of our time-delayed database replicas. Over the next four hours, we copied and restored the data to our primary and secondary replicas. The duration of the outage was due to the time it took to copy the data between the replicas and restore it into an active server."
postmortem
april 2017 by peakscale
Google Cloud Status Dashboard
february 2017 by peakscale
"On Monday 30 January 2017, newly created Google Compute Engine instances, Cloud VPNs and network load balancers were unavailable for a duration of 2 hours 8 minutes."
postmortem
february 2017 by peakscale
The Travis CI Blog: The day we deleted our VM images
september 2016 by peakscale
"In addition, our cleanup service had been briefly disabled to troubleshooting a potential race condition. Then we turned the automated cleanup back on. The service had a default hard coded amount of how many image names to query from our internal image catalog and it was set to 100.
When we started the cleanup service, the list of 100 image names, sorted by newest first, did not include our stable images, which were the oldest, did not get included in the results. Our cleanup service then promptly started deleting the older images from GCE, because its view of the world told it that those older images where no longer in use, i.e it looked like they were not in our catalog and all of our stable images got irrevocably deleted.
This immediately stopped builds from running. "
postmortem
When we started the cleanup service, the list of 100 image names, sorted by newest first, did not include our stable images, which were the oldest, did not get included in the results. Our cleanup service then promptly started deleting the older images from GCE, because its view of the world told it that those older images where no longer in use, i.e it looked like they were not in our catalog and all of our stable images got irrevocably deleted.
This immediately stopped builds from running. "
september 2016 by peakscale
Google Cloud Status Dashboard
august 2016 by peakscale
"While removing a faulty router from service, a new procedure for diverting traffic from the router was used. This procedure applied a new configuration that resulted in announcing some Google Cloud Platform IP addresses from a single point of presence in the southwestern US. As these announcements were highly specific they took precedence over the normal routes to Google's network and caused a substantial proportion of traffic for the affected network ranges to be directed to this one point of presence. This misrouting directly caused the additional latency some customers experienced.
Additionally this misconfiguration sent affected traffic to next-generation infrastructure that was undergoing testing. This new infrastructure was not yet configured to handle Cloud Platform traffic and applied an overly-restrictive packet filter."
postmortem
google
Additionally this misconfiguration sent affected traffic to next-generation infrastructure that was undergoing testing. This new infrastructure was not yet configured to handle Cloud Platform traffic and applied an overly-restrictive packet filter."
august 2016 by peakscale
Stack Exchange Network Status — Outage Postmortem - July 20, 2016
july 2016 by peakscale
"The direct cause was a malformed post that caused one of our regular expressions to consume high CPU on our web servers. The post was in the homepage list, and that caused the expensive regular expression to be called on each home page view. "
postmortem
july 2016 by peakscale
Summary of the AWS Service Event in the Sydney Region
june 2016 by peakscale
"The service disruption primarily affected EC2 instances and their associated Elastic Block Store (“EBS”) volumes running in a single Availability Zone. "
aws
postmortem
june 2016 by peakscale
Crates.io is down [fixed] - The Rust Programming Language Forum
june 2016 by peakscale
OK, a quick post-mortem:
At 9:45 AM PST I got a ping that crates.io was down and started looking into it. Connections via the website and from the 'cargo' command were timing out. From Heroku's logs it looks like the timeouts began around 9:10 AM.
From looking at logs (18 28) it's clear that connections were timing out, and that a number of postgres queries were blocked updating the download statistics10. These queries were occupying all available connections.
After killing outstanding queries the site is working again. It's not clear yet what the original cause was.
postmortem
At 9:45 AM PST I got a ping that crates.io was down and started looking into it. Connections via the website and from the 'cargo' command were timing out. From Heroku's logs it looks like the timeouts began around 9:10 AM.
From looking at logs (18 28) it's clear that connections were timing out, and that a number of postgres queries were blocked updating the download statistics10. These queries were occupying all available connections.
After killing outstanding queries the site is working again. It's not clear yet what the original cause was.
june 2016 by peakscale
SNOW Status - Elevated Errors on SNOW Backend
may 2016 by peakscale
"Todays outage was because of a mis-configuration in our Redis cluster, where we didn't automatically prune stale cache keys."
postmortem
may 2016 by peakscale
Postmortem: A tale of how Discourse almost took us out.
may 2016 by peakscale
"TL;DR
This morning we noticed that Sidekiq had 13K jobs, it quickly escalated to 14K and then 17K and kept growing, for reasons we do not understand yet. We know this was initially cause by a large backlog of emails that needed to be sent because of exceptions that were occurring due to this bug, this is when things got interesting, and got wildly out of control."
postmortem
This morning we noticed that Sidekiq had 13K jobs, it quickly escalated to 14K and then 17K and kept growing, for reasons we do not understand yet. We know this was initially cause by a large backlog of emails that needed to be sent because of exceptions that were occurring due to this bug, this is when things got interesting, and got wildly out of control."
may 2016 by peakscale
Elastic Cloud Outage: Root Cause and Impact Analysis | Elastic
may 2016 by peakscale
"What happened behind the scenes was that our Apache ZooKeeper cluster lost quorum, for the first time in more than three years. After recent maintenance, a heap space misconfiguration on the new nodes resulted in high memory pressure on the ZooKeeper quorum nodes, causing ZooKeeper to spend almost all CPU garbage collecting. When an auxiliary service that watches a lot of the ZooKeeper database reconnected, this threw ZooKeeper over the top, which in turn caused other services to reconnect – resulting in a thundering herd effect that exacerbated the problem."
postmortem
may 2016 by peakscale
Connectivity issues with Cloud VPN in asia-east1 - Google Groups
april 2016 by peakscale
"On Monday, 11 April, 2016, Google Compute Engine instances in all regions
lost external connectivity for a total of 18 minutes"
google
postmortem
lost external connectivity for a total of 18 minutes"
april 2016 by peakscale
**Gliffy Online System Outage** : Gliffy Support Desk
march 2016 by peakscale
"On working to resolve the issue, an administrator accidentally deleted the production database."
postmortem
march 2016 by peakscale
What Happened: Adobe Creative Cloud Update Bug
february 2016 by peakscale
"Wednesday night we started getting support tickets relating to the .bzvol file being removed from computers. Normally the pop-up sends people to this (which we have since edited to highlight this current issue): bzvol webpage. The problem was, the folks on Mac kept reporting that our fix did not work and that they kept getting the error. Our support team contacted our lead Mac developer for help trying to troubleshoot and figure out what was causing this surge."
postmortem
february 2016 by peakscale
January 28th Incident Report · GitHub
february 2016 by peakscale
"Our early response to the event was complicated by the fact that many of our ChatOps systems were on servers that had rebooted. We do have redundancy built into our ChatOps systems, but this failure still caused some amount of confusion and delay at the very beginning of our response. "
"We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code."
postmortem
"We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code."
february 2016 by peakscale
Linode Blog » The Twelve Days of Crisis – A Retrospective on Linode’s Holiday DDoS Attacks
january 2016 by peakscale
"Lesson one: don’t depend on middlemen
Lesson two: absorb larger attacks
Lesson three: let customers know what’s happening"
postmortem
networks
Lesson two: absorb larger attacks
Lesson three: let customers know what’s happening"
january 2016 by peakscale
Linode Status - An update from Linode about the recent DDoS attacks
january 2016 by peakscale
Original update on Linode DDoS (there is a detailed followup)
postmortem
networks
january 2016 by peakscale
Outage postmortem (2015-12-17 UTC) : Stripe: Help & Support
december 2015 by peakscale
"the retry feedback loop and associated performance degradation prevented us from accepting new API requests at all"
postmortem
december 2015 by peakscale
Postmortem: Outage due to Elasticsearch’s flexibility and our carelessness
december 2015 by peakscale
"To add promotions in the same Elasticsearch, the CMS team decided to add a new doctype/mapping in Elasticsearch called promotions. The feature was tested locally by the developer and it worked fine. No issues were caught anywhere during testing and the code was pushed to production.
The feature came into use when our content team started curating the content past midnight for the sale starting the next day. Once they added the content and started the re-indexing procedure (which is a manual button click), our consumer app stopped working. As soon as the team started indexing the content around 4:30 AM, our app stopped working. Our search queries started returning NumberFormatException (add more details here) on our price field."
postmortem
The feature came into use when our content team started curating the content past midnight for the sale starting the next day. Once they added the content and started the re-indexing procedure (which is a manual button click), our consumer app stopped working. As soon as the team started indexing the content around 4:30 AM, our app stopped working. Our search queries started returning NumberFormatException (add more details here) on our price field."
december 2015 by peakscale
400 errors when trying to create an external (L2) Load Balancer for GCE/GKE services - Google Groups
december 2015 by peakscale
"a minor update to the Compute Engine API inadvertently changed the case-sensitivity of the “sessionAffinity” enum variable in the target pool definition, and this variation was not covered by testing."
postmortem
google
december 2015 by peakscale
Postmortem: Server compromised due to publicly accessible Redis — Kevin Chen
december 2015 by peakscale
"My server was compromised through Redis and used as part of a DDOS"
postmortem
security
december 2015 by peakscale
Network Connectivity and Latency Issues in Europe
november 2015 by peakscale
"On Tuesday, 10 November 2015, outbound traffic going through one of our
European routers from both Google Compute Engine and Google App Engine
experienced high latency for a duration of 6h43m minutes. If your service
or application was affected, we apologize — this is not the level of
quality and reliability we strive to offer you, and we have taken and are
taking immediate steps to improve the platform’s performance and
availability. "
postmortem
google
networks
European routers from both Google Compute Engine and Google App Engine
experienced high latency for a duration of 6h43m minutes. If your service
or application was affected, we apologize — this is not the level of
quality and reliability we strive to offer you, and we have taken and are
taking immediate steps to improve the platform’s performance and
availability. "
november 2015 by peakscale
Spreedly Status - 503 Service Unavailable
october 2015 by peakscale
"The issue here was an unbounded queue. We'll address that by leveraging rsyslog's advanced queueing options without neglecting one very important concern: certain activities must always be logged in the system. Also, we need to know when rsyslog is unable to work off it's queue, so we are going to find a way to be alerted as soon as that is the case."
postmortem
october 2015 by peakscale
CircleCI Status - Load balancer misconfiguration
october 2015 by peakscale
"A load balancer misconfiguration briefly prevented us from serving content. We caught the problem and are fixing"
postmortem
october 2015 by peakscale
Voicebase Status - High API latency caused by multiple issues in AWS including SQS API errors
october 2015 by peakscale
"The root cause appears to be a number of problem with AWS, including a very high failure rate with the Amazon SQS API, and with the Amazon DynamoDB service"
postmortem
october 2015 by peakscale
Codeship Status - Intermittent Website Availability Issues
october 2015 by peakscale
"Codeship's DNS provider has implemented new hardware and networking to overcome their ongoing denial of service attack. They report some name servers are coming back online, but they are still dealing with a partial DNS outage"
postmortem
october 2015 by peakscale
Linode Status - Network Issues within London Datacenter
october 2015 by peakscale
"An older generation switch was identified that had a malfunctioning transceiver module. Under normal conditions, the full 1+1 hardware redundancy within the London network fabric would have isolated this failure without any functional impact. However, this transceiver module had not failed completely; rather, the module was experiencing severe voltage fluctuation, causing it to 'flap' in an erratic manner."
postmortem
networks
october 2015 by peakscale
Keen IO Status - Query Service Errors
october 2015 by peakscale
"Our internal load balancer (HAProxy) got stuck in an ornery state, and it took us a while to realize it was the load balancer instead of the actual services causing the errors."
postmortem
october 2015 by peakscale
AWeber Status - Isolated Malware Incident
october 2015 by peakscale
"We have identified an isolated incident of a website that uses AWeber has been infected by malware. As a response, Google has marked all links from AWeber customers using click tracking (redirecting through clicks.aweber.com) as potential malware"
postmortem
october 2015 by peakscale
Chargify Status - 4 minute outage
october 2015 by peakscale
"For approximately 4 minutes we just experienced an unexpected outage due to a failure of thedatabase load balancer to recover from planned maintenance"
postmortem
october 2015 by peakscale
LiveChat Status - Connection issues
october 2015 by peakscale
"identified the issue"
"working on a fix"
"fix deployed"
"connection issues caused by memory overload"
postmortem
"working on a fix"
"fix deployed"
"connection issues caused by memory overload"
october 2015 by peakscale
StatusPage.io Status - Site under heavy load. Pages may be slow or unresponsive.
october 2015 by peakscale
META
"A large influx of traffic caused most of the web tier to become unresponsive for a period of a few minutes. This influx of traffic as stopped, and site functioning has returned to normal."
postmortem
"A large influx of traffic caused most of the web tier to become unresponsive for a period of a few minutes. This influx of traffic as stopped, and site functioning has returned to normal."
october 2015 by peakscale
Switch Status - Inbound Calling Experienced Intermittent Issues
october 2015 by peakscale
"5 minute issue was caused by a major traffic spike that impacted some inbound call attempts. Further investigation into the nature of these calls is ongoing"
postmortem
october 2015 by peakscale
Greenhouse Status - AWS US-East-1 Partial Outage
october 2015 by peakscale
"Our cloud hosting provider (AWS) is currently experiencing significant service degradation. The knock-on effect is decreased response times and reliability for the Greenhouse application, API, and job boards. "
postmortem
october 2015 by peakscale
VictorOps Status - Isolated Service Disruption
october 2015 by peakscale
"our database cluster encountered an error condition that caused it to stop processing queries for approximately 30 minutes. During that time, our WebUI and mobile client access was unavailable, and we were unable to process and deliver alerts. We have identified configuration settings in the cluster that will prevent a recurrence of the error condition"
postmortem
october 2015 by peakscale
Greenhouse Status - Outage
october 2015 by peakscale
"a new release triggered a doubling of connections to our database--this caused the database to become saturated with connections, causing approximately 10 minutes of system-wide downtime"
postmortem
october 2015 by peakscale
Route Leak Causes Amazon and AWS Outage
october 2015 by peakscale
"The forwarding loss combined with the sudden appearance of these two ASNs in the BGP paths strongly suggested a BGP route leak by Axcelx. Looking at the raw BGP data showed the exact BGP updates that resulted in this leak."
postmortem
aws
networks
october 2015 by peakscale
June 15th Outage — HipChat Blog
october 2015 by peakscale
" the recent Mac client release which had the much anticipated “multiple account” feature also had a subtle reconnection bug that only manifested under very high load"
postmortem
october 2015 by peakscale
Mikhail Panchenko [discussion of a long past Flickr problem]
october 2015 by peakscale
" Inserting data into RDMS indexes is relatively expensive, and usually involves at least some locking. Note that dequeueing jobs also involves an index update, so even marking jobs as in progress or deleting on completion runs into the same locks. So now you have contention from a bunch of producers on a single resource, the updates to which are getting more and more expensive and time consuming. Before long, you're spending more time updating the job index than you are actually performing the jobs. The "queue" essentially fails to perform one of its very basic functions."
postmortem
october 2015 by peakscale
Kafkapocalypse: a postmortem on our service outage | Parse.ly
october 2015 by peakscale
" Kafka is so efficient about its resource disk and CPU consumption, that we were running Kafka brokers on relatively modest Amazon EC2 nodes that did not have particularly high network capabilities. At some point, we were hitting operating system network limits and the brokers would simply become unavailable. These limits were probably enforced by Linux, Amazon’s Xen hypervisor, the host machine’s network hardware, or some combination.
The real problem here isn’t failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes. It was a classic Cascading Failure."
postmortem
The real problem here isn’t failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes. It was a classic Cascading Failure."
october 2015 by peakscale
Travis CI Status - High queue times on OSX builds (.com and .org)
october 2015 by peakscale
"When we reviewed the resource utilization on our vSphere infrastructure, we discovered we had over 6000 virtual machines on the Xserve cluster. During normal peak build times, this number shouldn't be more than 200."
postmortem
october 2015 by peakscale
Post-mortem -- S3 Outage | Status.io Blog
october 2015 by peakscale
"Immediately we realized that the static resources (images, scripts) hosted on Amazon S3 were sporadically failing to load. A quick manual test of the S3 connection confirmed it was broken."
postmortem
october 2015 by peakscale
Incident documentation/20150814-MediaWiki - Wikitech
october 2015 by peakscale
"This bug was not caught on the beta cluster, because the code-path is exercised when converting text from one language variant to another, which does not happen frequently in that environment."
postmortem
october 2015 by peakscale
GitLab.com outage on 2015-09-01 | GitLab
october 2015 by peakscale
"We are still in the dark about the cause of the NFS slowdowns. We see no spikes of any kind of web requests around the slowdowns. The backend server only shows the ext4 errors mentioned above, which do not coincide with the NFS trouble, and no NFS error messages."
postmortem
october 2015 by peakscale
Opbeat Status - We're experiencing another major database cluster connectivity issue
october 2015 by peakscale
"During the master database outages, opbeat.com as well as our intake was unavailable."
" We've reached out to AWS to understand what caused the connectivity issue in the first place, but they have been unable to find the cause."
postmortem
" We've reached out to AWS to understand what caused the connectivity issue in the first place, but they have been unable to find the cause."
october 2015 by peakscale
Flying Circus Status - VM performance and stability issues
october 2015 by peakscale
"After checking the virtualisation servers we saw that many of them had too many virtual machines assigned to them, consuming much more memory than the host actually had. "
"Looking at the algorithm that performed the evacuation when maintenance was due, we found that it located virtual machines to the best possible server. What it did not do was to prohibit machines being placed on hosts that already have too many machines. "
postmortem
"Looking at the algorithm that performed the evacuation when maintenance was due, we found that it located virtual machines to the best possible server. What it did not do was to prohibit machines being placed on hosts that already have too many machines. "
october 2015 by peakscale
Copy this bookmark: