January 28th Incident Report · GitHub
"Our early response to the event was complicated by the fact that many of our ChatOps systems were on servers that had rebooted. We do have redundancy built into our ChatOps systems, but this failure still caused some amount of confusion and delay at the very beginning of our response. "

"We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code."
postmortem 
february 2016
Linode Blog » The Twelve Days of Crisis – A Retrospective on Linode’s Holiday DDoS Attacks
"Lesson one: don’t depend on middlemen
Lesson two: absorb larger attacks
Lesson three: let customers know what’s happening"
postmortem  networks 
january 2016
Outage postmortem (2015-12-17 UTC) : Stripe: Help & Support
"the retry feedback loop and associated performance degradation prevented us from accepting new API requests at all"
postmortem 
december 2015
Postmortem: Outage due to Elasticsearch’s flexibility and our carelessness
"To add promotions in the same Elasticsearch, the CMS team decided to add a new doctype/mapping in Elasticsearch called promotions. The feature was tested locally by the developer and it worked fine. No issues were caught anywhere during testing and the code was pushed to production.

The feature came into use when our content team started curating the content past midnight for the sale starting the next day. Once they added the content and started the re-indexing procedure (which is a manual button click), our consumer app stopped working. As soon as the team started indexing the content around 4:30 AM, our app stopped working. Our search queries started returning NumberFormatException (add more details here) on our price field."
postmortem 
december 2015
400 errors when trying to create an external (L2) Load Balancer for GCE/GKE services - Google Groups
"a minor update to the Compute Engine API inadvertently changed the case-sensitivity of the “sessionAffinity” enum variable in the target pool definition, and this variation was not covered by testing."
postmortem  google 
december 2015
Network Connectivity and Latency Issues in Europe
"On Tuesday, 10 November 2015, outbound traffic going through one of our
European routers from both Google Compute Engine and Google App Engine
experienced high latency for a duration of 6h43m minutes. If your service
or application was affected, we apologize — this is not the level of
quality and reliability we strive to offer you, and we have taken and are
taking immediate steps to improve the platform’s performance and
availability. "
postmortem  google  networks 
november 2015
Spreedly Status - 503 Service Unavailable
"The issue here was an unbounded queue. We'll address that by leveraging rsyslog's advanced queueing options without neglecting one very important concern: certain activities must always be logged in the system. Also, we need to know when rsyslog is unable to work off it's queue, so we are going to find a way to be alerted as soon as that is the case."
postmortem 
october 2015
CircleCI Status - Load balancer misconfiguration
"A load balancer misconfiguration briefly prevented us from serving content. We caught the problem and are fixing"
postmortem 
october 2015
Voicebase Status - High API latency caused by multiple issues in AWS including SQS API errors
"The root cause appears to be a number of problem with AWS, including a very high failure rate with the Amazon SQS API, and with the Amazon DynamoDB service"
postmortem 
october 2015
Codeship Status - Intermittent Website Availability Issues
"Codeship's DNS provider has implemented new hardware and networking to overcome their ongoing denial of service attack. They report some name servers are coming back online, but they are still dealing with a partial DNS outage"
postmortem 
october 2015
Linode Status - Network Issues within London Datacenter
"An older generation switch was identified that had a malfunctioning transceiver module. Under normal conditions, the full 1+1 hardware redundancy within the London network fabric would have isolated this failure without any functional impact. However, this transceiver module had not failed completely; rather, the module was experiencing severe voltage fluctuation, causing it to 'flap' in an erratic manner."
postmortem  networks 
october 2015
Keen IO Status - Query Service Errors
"Our internal load balancer (HAProxy) got stuck in an ornery state, and it took us a while to realize it was the load balancer instead of the actual services causing the errors."
postmortem 
october 2015
AWeber Status - Isolated Malware Incident
"We have identified an isolated incident of a website that uses AWeber has been infected by malware. As a response, Google has marked all links from AWeber customers using click tracking (redirecting through clicks.aweber.com) as potential malware"
postmortem 
october 2015
Chargify Status - 4 minute outage
"For approximately 4 minutes we just experienced an unexpected outage due to a failure of thedatabase load balancer to recover from planned maintenance"
postmortem 
october 2015
LiveChat Status - Connection issues
"identified the issue"

"working on a fix"

"fix deployed"

"connection issues caused by memory overload"
postmortem 
october 2015
StatusPage.io Status - Site under heavy load. Pages may be slow or unresponsive.
META

"A large influx of traffic caused most of the web tier to become unresponsive for a period of a few minutes. This influx of traffic as stopped, and site functioning has returned to normal."
postmortem 
october 2015
Switch Status - Inbound Calling Experienced Intermittent Issues
"5 minute issue was caused by a major traffic spike that impacted some inbound call attempts. Further investigation into the nature of these calls is ongoing"
postmortem 
october 2015
Greenhouse Status - AWS US-East-1 Partial Outage
"Our cloud hosting provider (AWS) is currently experiencing significant service degradation. The knock-on effect is decreased response times and reliability for the Greenhouse application, API, and job boards. "
postmortem 
october 2015
VictorOps Status - Isolated Service Disruption
"our database cluster encountered an error condition that caused it to stop processing queries for approximately 30 minutes. During that time, our WebUI and mobile client access was unavailable, and we were unable to process and deliver alerts. We have identified configuration settings in the cluster that will prevent a recurrence of the error condition"
postmortem 
october 2015
Greenhouse Status - Outage
"a new release triggered a doubling of connections to our database--this caused the database to become saturated with connections, causing approximately 10 minutes of system-wide downtime"
postmortem 
october 2015
Route Leak Causes Amazon and AWS Outage
"The forwarding loss combined with the sudden appearance of these two ASNs in the BGP paths strongly suggested a BGP route leak by Axcelx. Looking at the raw BGP data showed the exact BGP updates that resulted in this leak."
postmortem  aws  networks 
october 2015
June 15th Outage — HipChat Blog
" the recent Mac client release which had the much anticipated “multiple account” feature also had a subtle reconnection bug that only manifested under very high load"
postmortem 
october 2015
Mikhail Panchenko [discussion of a long past Flickr problem]
" Inserting data into RDMS indexes is relatively expensive, and usually involves at least some locking. Note that dequeueing jobs also involves an index update, so even marking jobs as in progress or deleting on completion runs into the same locks. So now you have contention from a bunch of producers on a single resource, the updates to which are getting more and more expensive and time consuming. Before long, you're spending more time updating the job index than you are actually performing the jobs. The "queue" essentially fails to perform one of its very basic functions."
postmortem 
october 2015
Kafkapocalypse: a postmortem on our service outage | Parse.ly
" Kafka is so efficient about its resource disk and CPU consumption, that we were running Kafka brokers on relatively modest Amazon EC2 nodes that did not have particularly high network capabilities. At some point, we were hitting operating system network limits and the brokers would simply become unavailable. These limits were probably enforced by Linux, Amazon’s Xen hypervisor, the host machine’s network hardware, or some combination.

The real problem here isn’t failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes. It was a classic Cascading Failure."
postmortem 
october 2015
Travis CI Status - High queue times on OSX builds (.com and .org)
"When we reviewed the resource utilization on our vSphere infrastructure, we discovered we had over 6000 virtual machines on the Xserve cluster. During normal peak build times, this number shouldn't be more than 200."
postmortem 
october 2015
Post-mortem -- S3 Outage | Status.io Blog
"Immediately we realized that the static resources (images, scripts) hosted on Amazon S3 were sporadically failing to load. A quick manual test of the S3 connection confirmed it was broken."
postmortem 
october 2015
Incident documentation/20150814-MediaWiki - Wikitech
"This bug was not caught on the beta cluster, because the code-path is exercised when converting text from one language variant to another, which does not happen frequently in that environment."
postmortem 
october 2015
GitLab.com outage on 2015-09-01 | GitLab
"We are still in the dark about the cause of the NFS slowdowns. We see no spikes of any kind of web requests around the slowdowns. The backend server only shows the ext4 errors mentioned above, which do not coincide with the NFS trouble, and no NFS error messages."
postmortem 
october 2015
Opbeat Status - We're experiencing another major database cluster connectivity issue
"During the master database outages, opbeat.com as well as our intake was unavailable."

" We've reached out to AWS to understand what caused the connectivity issue in the first place, but they have been unable to find the cause."
postmortem 
october 2015
Flying Circus Status - VM performance and stability issues
"After checking the virtualisation servers we saw that many of them had too many virtual machines assigned to them, consuming much more memory than the host actually had. "
"Looking at the algorithm that performed the evacuation when maintenance was due, we found that it located virtual machines to the best possible server. What it did not do was to prohibit machines being placed on hosts that already have too many machines. "
postmortem 
october 2015
Faithlife Status - Leaf Switch Outage
"we lost all connectivity to the switch. At that point, it failed to continue passing traffic. This should not have been a problem since compute nodes have a link to each switch. The compute nodes should have recognized that link in the aggregation as “down” and discontinued use of that link. However, our compute nodes continued sending traffic to that link. To verify the compute nodes didn’t incorrectly see the downed link as “up”, we physically disconnected the links to the degraded switch. Unfortunately, the compute nodes still attempted to send traffic over that link. "
postmortem  networks 
october 2015
Outage report: 5 September 2015 - PythonAnywhere News
"t looked like a massive earlier spike in read activity across the filesystem had put some system processes into a strange state. User file storage is shared from the file storage system over to the servers where people's websites and consoles actually run over NFS. And the storage itself on the NFS servers uses DRBD for replication. As far as we could determine, a proportion (at least half) of the NFS processes were hanging, and the DRBD processes were running inexplicably slowly. This was causing access to file storage to frequently be slow, and to occasionally fail, for all servers using the file storage."
postmortem 
october 2015
Why did Stack Overflow time out? - Meta Stack Exchange
"A cascading failure of the firewall to apply properly leading keepalived VRRP unable to communicate properly made both load balancers think neither had a peer. This results in a bit of swapping as they fight for ARP. When NY-LB06 "won" that fight, the second failure came into play: the firewall module did not finish updating on the first puppet run meaning the server was fully ready to serve traffic (from a Layer 7/HAProxy standpoint), but was not accepting TCP connections from anyone yet."
postmortem 
october 2015
SNworks Status - Connectivity Issues
"The cache server is super important. Without it, our web servers can serve 10s of requests a second. With it, they can serve 1000s of requests per second.

Large news events + uncached pages = servers not being able to handle traffic demands."
postmortem 
october 2015
Customer.io Status - Extended outage since 11:41 pm EDT
" After bringing down the cluster, we upgraded FoundationDB and attempted to bring the cluster back up.

An error prevented the cluster from returning to "operational"."
postmortem 
october 2015
Partial Photon Cloud outage on 04/30/2015 | Blog | Photon: Multiplayer Made Simple
"The root cause was an outage of Microsoft Azure’s Network Infrastructure"
postmortem 
october 2015
Simplero Status - Down right now
"The root cause was a disk that filled up on a secondary database server.

it happened while I was at event at the Empire State Building. At first I tried fixing it via my iPhone, but I had to realize that I couldn't, and instead hop on a train back home to my laptop."
postmortem 
october 2015
Divshot Status - Serious Platform Outage
"This morning around 7:20am Pacific time, several platform EC2 instances began failing and our load balancer began returning 503 errors. Ordinarily our scaling configuration would terminate and replace unhealthy instances, but for an as-yet-undetermined reason all instances became unhealthy and were not replaced. This caused a widespread outage that lasted for nearly two hours."
postmortem 
october 2015
More Details on Today's Outage [2010]
"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.

The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid."
postmortem 
october 2015
OHIO: Office of Information Technology |Anatomy of a network outage: Aug 20 - Sep 4, 2015
"If an IT system has a hidden weakness, fall opening will expose it.

This year, the 17,000 additional devices that students and staff brought with them to campus did just that, overwhelming our core network for over two weeks."
postmortem  networks 
october 2015
Outage postmortem (2015-10-08 UTC) : Stripe: Help & Support
"Our automated tooling actually filed it as two separate change requests: one to add a new database index and a second to remove the old database index. Both of these change requests were reflected in our dashboard for database operators, intermingled with many other alerts. The dashboard did not indicate that the deletion request depended on the successful completion of the addition request"
postmortem 
october 2015
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
"But, on Sunday morning, a portion of the metadata service responses exceeded the retrieval and transmission time allowed by storage servers. As a result, some of the storage servers were unable to obtain their membership data, and removed themselves from taking requests"

"With a larger size, the processing time inside the metadata service for some membership requests began to approach the retrieval time allowance by storage servers. We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests."
postmortem  aws 
september 2015
Postmortem for July 27 outage of the Manta service - Blog - Joyent
"Clients experienced very high latency, and ultimately received 500-level errors in response to about 22% of all types of requests, including PUT, GET, and DELETE for both objects and directories. At peak, the error rate approached 27% of all requests, and for most of the window the error rate varied between 19 and 23%."
postmortem 
august 2015
Travis CI Status - Elevated wait times and timeouts for OSX builds
"We noticed an error pointing towards build VM boot timeouts at 23:21 UTC on the 20th. After discussion with our infrastructure provider, it was shown to us that our SAN (a NetApp appliance) was being overloaded due to a spike in disk operations per second."
postmortem 
july 2015
CircleCI Status - DB performance issue
"The degradation in DB performance was a special kind of non-linear, going from "everything is fine" to "fully unresponsive" within 2 minutes. Symptoms included a long list of queued builds, and each query taking a massive amount of time to run, along side many queries timing out."

"CircleCI is written in Clojure, a form of Lisp. One of the major advantages of this type of language is that you can recompile code live at run-time. We typically use Immutable Architecture, in which we deploy pre-baked machine images to put out new code. This works well for keeping the system in a clean state, as part of a continuous delivery model. Unfortunately, when things are on fire, it doesn't allow us to move as quickly as we would like.

This is where Clojure's live patching comes in. By connecting directly to our production machines and connecting to the Clojure REPL, we can change code live and in production. We've built tooling over the last few years to automate this across the hundreds of machines we run at a particular time.

This has saved us a number of times in the past, turning many potential-crise into mere blips. Here, this allowed us to disable queries, fix bugs and otherwise swap out and disable undesirable code when needed."
postmortem 
july 2015
NYSE Blames Trading Outage on Software Upgrade | Traders Magazine Online News
"On Tuesday evening, the NYSE began the rollout of a software release in preparation for the July 11 industry test of the upcoming SIP timestamp requirement. As is standard NYSE practice, the initial release was deployed on one trading unit. As customers began connecting after 7am on Wednesday morning, there were communication issues between customer gateways and the trading unit with the new release. It was determined that the NYSE and NYSE MKT customer gateways were not loaded with the proper configuration compatible with the new release."
postmortem 
july 2015
Elevated latency and error rate for Google Compute Engine API - Google Groups
"However, a software bug in the GCE control plane interacted
poorly with this change and caused API requests directed to us-central1-a
to be rejected starting at 03:21 PDT. Retries and timeouts from the failed
calls caused increased load on other API backends, resulting in higher
latency for all GCE API calls. The API issues were resolved when Google
engineers identified the control plane issue and corrected it at 04:59 PDT,
with the backlog fully cleared by 05:12 PDT. "
google  postmortem 
may 2015
Code Climate Status - Inaccurate Analysis Results
"Also on May 8th, we deployed instrumentation and logging to track when our cached Git blob data did not match the actual contents on disk. We found no further mismatches on new analyses, supporting the theory that the issue was ephemeral and no longer present.

Around this time we began a process of re-running old analyses that had failed, and were able to reproduce the issue. This was a critical learning, because it refuted the theory that the issue was ephemeral. With this information, we took a closer look at the objects in the analysis-level cache. We discovered that these marshaled Ruby objects did not in fact hold a reference to the contents of files as we originally believed. Problematically, the object held a reference to the Git service URL to use for remote procedure calls.

When a repository was migrated, this cache key was untouched. This outdated reference led to cat-file calls being issued to the old server instead of the new server"
postmortem 
may 2015
Stack Exchange Network Status — Outage Postmortem: January 6th, 2015
"With no way to get our main IP addresses accessible to most users, our options were to either fail over to our DR datacenter in read-only mode, or to enable CloudFlare - we’ve been testing using them for DDoS mitigation, and have separate ISP links in the NY datacenter which are dedicated to traffic from them.

We decided to turn on CloudFlare, which caused a different problem - caused by our past selves."
postmortem 
april 2015
DripStat — Post mortem of yesterday's outage
"1. RackSpace had an outage in their Northern Virginia region.
2. We were getting DDOS’d.
3. The hypervisor Rackspace deployed our cloud server on was running into issue and would keep killing our java process.
We were able to diagnose 2 and 3 only after Rackspace recovered from their long load balancer outage. The fact that all 3 happened at the same time did not help issues either."
postmortem 
april 2015
Blog - Tideways
"On Wednesday 6:05 Europe/Berlin time our Elasticsearch cluster went down when it ran OutOfMemory and file descriptors. One node of the cluster did not recover from this error anymore and the other responded to queries with failure.

The workers processing the performance and trace event log data with Beanstalk message queue stopped.
"
postmortem 
april 2015
CopperEgg Status - Probe widgets not polling data
"the primary of a redundant pair of data servers for one of our customer data clusters locked up hard in Amazon. An operations engineer responded to a pager alert and ensured failover had worked as designed; there was a brief period of probe delay on that cluster from the initial failover but service was only briefly interrupted and then the system was working fine.

The failed server had to be hard rebooted and when it was, its data was corrupted and the server had to be rebuilt, and was then set up to resync its data with the live server. A manual error was made during the rebuild and replication was set up in an infinite loop. "
postmortem 
april 2015
Freckle Time Tracking Status - Freckle is down
"The underlying reason why nginx didn't start was that DNS was not working properly—nginx checks SSL certificates and it couldn't resolve one of the hosts needed to verify our main SSL certificate. We don't know why DNS didn't resolve, but it's likely that to the large number of booted servers in the Rackspace datacenter there was a temporary problem with DNS resolution requests.)"
postmortem 
april 2015
Dead Man's Snitch — Postmortem: March 6th, 2015
"On Friday, March 6th we had a major outage caused by a loss of historical data. During the outage we failed to alert on missed snitch check-ins and sent a large number of erroneous failure alerts for healthy snitches. It took 8 hours to restore or reconstruct all missing data and get our systems stabilized."
postmortem 
april 2015
Travis CI Status - Slow .com build processing
"Two runaway TLS connections inside our primary RabbitMQ node that were causing high CPU usage on that node. Once this was found, we deemed the high channel count a red herring and instead started work on the stuck connections."
postmortem 
april 2015
Travis CI Status - Slow .com build processing
"We looked at our metrics and quickly realised that our RabbitMQ instance had gone offline at 17:30 UTC. We tried to bring it back up, but it wouldn’t start up cleanly. One of the remediation actions after Tuesday’s RabbitMQ outage was to upgrade our cluster to run on more powerful servers, so we decided that instead of debugging why our current cluster wasn’t starting we’d perform emergency maintenance and spin up a new cluster."
postmortem 
april 2015
Balanced Partial Outage Post Mortem - 2015-03-15
Balanced experienced a partial outage that affected 25% of card processing transactions between 8:40AM and 9:42AM this morning due to a degraded machine which was not correctly removed from the load balancer.

The core of the issue was in our secure vault system, which handles storage and retrieval of sensitive card data. One of the machines stopped sending messages, which cause some requests to be queued up but not processed but our automated health checks did not flag the machine as unhealthy.
postmortem 
april 2015
Postmortem: Storify downtime on March 2nd (with image) · storifydev · Storify
"The problem was that we had one dropped index in our application code. This meant that whenever the new primary took the lead, the application asked to build that index. It was happening in the background, so it was kind of ok for the primary. But as soon as the primary finished, all the secondaries started building it in the foreground, which meant that our application couldn't reach MongoDB anymore."
postmortem 
march 2015
GCE instances are not reachable
"ROOT CAUSE [PRELIMINARY]

The internal software system which programs GCE’s virtual network for VM
egress traffic stopped issuing updated routing information. The cause of
this interruption is still under active investigation. Cached route
information provided a defense in depth against missing updates, but GCE VM
egress traffic started to be dropped as the cached routes expired. "
postmortem 
february 2015
FAQ about the recent FBI raid (Pinboard Blog)
"Why did the FBI take a Pinboard server?

I don't know. As best I can tell, the FBI was after someone else whose server was in physical proximity to ours. "
postmortem 
february 2015
A Note on Recent Downtime (Pinboard Blog)
"Of course I was wrong about that, and my web hosts pulled the plug early in the morning on the 2nd. Bookmarks and archives were not affected, but I neglected to do a final sync of notes (notes in pinboard are saved as files). This meant about 20 users who created or edited notes between December 31 and Jan 2 lost those notes."
postmortem 
february 2015
Recent Bounciness And When It Will Stop (Pinboard Blog)
"Over the past week there have been a number of outages, ranging in length from a few seconds to a couple of hours. Until recently, Pinboard has had a good track record of uptime, and like my users I find this turn of events distressing.

I'd like to share what I know so far about the problem, and what steps I'm taking to fix it."
postmortem 
february 2015
Outage This Morning (Pinboard Blog)
"The root cause of the outage appears to have been a disk error. The server entered a state where nothing could write to disk, crashing the database. We were able to reboot the server, but then had to wait a long time for it to repair the filesystem."
postmortem 
february 2015
Second Outage (Pinboard Blog)
"The main filesystem on our web server suddenly went into read-only mode, crashing the database. Once again I moved all services to the backup machine while the main server went through its long disk check."
postmortem 
february 2015
API Outage (Pinboard Blog)
"Pinboard servers came under DDOS attack today and the colocation facility (Datacate) has insisted on taking the affected IP addresses offline for 48 hours. In my mind, this accomplishes the goal of the denial of service attack, but I am just a simple web admin.

I've moved the main site to a secondary server and will do the same for the APi in the morning (European time) when there's less chance of me screwing it up. Until then the API will be unreachable."
postmortem 
february 2015
A Bad Privacy Bug (Pinboard Blog)
"
tl;dr: because of poor input validation and a misdesigned schema, bookmarks could be saved in a way that made them look private to the ORM, but public to the database. Testing failed to catch the error because it was done from a non-standard account..

There are several changes I will make to prevent this class of problem from recurring:

Coerce all values to the expected types at the time they are saved to the database, rather than higher in the call stack.

Add assertions to the object loader so it complains to the error log if it sees unexpected values.

Add checks to the templating code to prevent public bookmarks showing up under any circumstances on certain public-only pages.

Run deployment tests from a non-privileged account."
postmortem 
february 2015
Facebook & Instagram API servers down
Not much detail there.

Config change? Security?

See: https://blog.thousandeyes.com/facebook-outage-deep-dive/

Also: "Facebook Inc. on Tuesday denied being the victim of a hacking attack and said its site and photo-sharing app Instagram had suffered an outage after it introduced a configuration change."
postmortem 
february 2015
Final Root Cause Analysis and Improvement Areas: Nov 18 Azure Storage Service Interruption | Microsoft Azure Blog
"1. The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.
[...]
2. Although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends."
postmortem 
december 2014
Incident Report - DDoS Attack - DNSimple Blog
"A new customer signed up for our service and brought in multiple domains that were already facing a DDoS attack. The customer had already tried at least 2 other providers before DNSimple. Once the domains were delegated to us, we began receiving the traffic from the DDoS.

DNSimple was not the target of the attack, nor were any of our other customers.

The volume of the attack was approximately 25gb/s sustained traffic across our networks, with around 50 million packets per second. In this case, the traffic was sufficient enough to overwhelm the 4 DDoS devices we had placed in our data centers after a previous attack (there is also a 5th device, but it was not yet online in our network)."
postmortem 
december 2014
craigslist DNS Outage | craigslist blog
"At approximately 5pm PST Sunday evening the craigslist domain name service (DNS) records maintained at one of our domain registrars were compromised, diverting users to various non-craigslist sites.

This issue has been corrected at the source, but many internet service providers (ISPs) cached the false DNS information for several hours, and some may still have incorrect information."
postmortem 
november 2014
Update on Azure Storage Service Interruption | Microsoft Azure Blog
" Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting."
postmortem 
november 2014
Anatomy of a Crushing (Pinboard Blog)
"The bad news was that it had never occurred to me to test the database under write load.

Now, I can see the beardos out there shaking their heads. But in my defense, heavy write loads seemed like the last thing Pinboard would ever face. It was my experience that people approached an online purchase of six dollars with the same deliberation and thoughtfulness they might bring to bear when buying a new car. Prospective users would hand-wring for weeks on Twitter and send us closely-worded, punctilious lists of questions before creating an account.

The idea that we might someday have to worry about write throughput never occurred to me. If it had, I would have thought it a symptom of nascent megalomania. "
postmortem 
november 2014
The network nightmare that ate my week
"I have come to the conclusion that so much in IPv6 design and implementation has been botched by protocol designers and vendors (both ours and others) that it is simply unsafe to run IPv6 on a production network except in very limited geographical circumstances and with very tight central administration of hosts."
postmortem 
november 2014
Inherent Complexity of the Cloud: VS Online Outage Postmortem
"it appears that the outage is at least due in part to some license checks that had been improperly disabled, causing unnecessary traffic to be generated.  Adding to the confusion (and possible causes) was the observation of “…a spike in latencies and failed deliveries of Service Bus messages”"
postmortem 
november 2014
Stack Exchange Network Status — Outage Post-Mortem: August 25th, 2014
"a misleading comment in the iptables configuration led us to make a harmful change. The change had the effect of preventing the HAProxy systems from being able to complete a connection to our IIS web servers - the response traffic for those connections (the SYN/ACK packet) was suddenly being blocked."
postmortem 
november 2014
Morgue: Helping Better Understand Events by Building a Post Mortem Tool - Bethany Macri on Vimeo
"My talk will be about why myself and another engineer built an internal post mortem tool called Morgue and the effect that the tool has had on our organization. Morgue formalized and systematized the way [my company] as a whole runs post mortems by focusing both the leader and the attendees of the post mortem on the most important aspects of resolving and understanding the event in a consistent way. In addition, the tool has facilitated relations between Ops and Engineers by increasing the awareness of Ops’ involvement in an outage and also by making all of the post mortems easily available to anyone in the organization. Lastly, all of our developers have access to the Morgue repository and have continued to develop features for the tool as improvements for conducting a post mortem have been suggested."
postmortem 
november 2014
« earlier      later »
automation aws build business cache cloudgeneral concurrency configuration consistency data debug devgeneral distributedgeneral dns functional git google government graphdb ha hadoop hpccloud http humor illumos internet io java joyent jvm legal loadbalancing logging madison management messages monitoring networks nginx nosql paas performance politics postgresql postmortem privacy python redis rest riak scala scraping security semantic spinnaker sql storage testing timf turk video virtualization visualization

Copy this bookmark:



description:


tags: