jm + azure   8

10-hour Microsoft Azure outage in Europe
Service availability issue in North Europe

Summary of impact: From 17:44 on 19 Jun 2018 to 04:30 UTC on 20 Jun 2018 customers using Azure services in North Europe may have experienced connection failures when attempting to access resources hosted in the region. Customers leveraging a subset of Azure services may have experienced residual impact for a sustained period post-mitigation of the underlying issue. We are communicating with these customers directly in their Management Portal.

Preliminary root cause: Engineers identified that an underlying temperature issue in one of the datacenters in the region triggered an infrastructure alert, which in turn caused a structured shutdown of a subset of Storage and Network devices in this location to ensure hardware and data integrity.

Mitigation: Engineers addressed the temperature issue, and performed a structured recovery of the affected devices and the affected downstream services.


The specific services were: 'Virtual Machines, Storage, SQL Database, Key Vault, App Service, Site Recovery, Automation, Service Bus, Event Hubs, Data Factory, Backup, API management, Log Analytics, Application Insight, Azure Batch Azure Search, Redis Cache, Media Services, IoT Hub, Stream Analytics, Power BI, Azure Monitor, Azure Cosmo DB or Logic Apps in North Europe'. Holy cow
microsoft  outages  fail  azure  post-mortems  cooling-systems  datacenters 
27 days ago by jm
Cloudy Gamer: Playing Overwatch on Azure's new monster GPU instances
pretty amazing. full 60FPS, 2560x1600, everything on Epic quality, streaming from Azure, for $2 per hour
gaming  azure  games  cloud  gpu  overwatch  streaming 
october 2016 by jm
Update on Azure Storage Service Interruption
As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services. Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables. We typically call this “flighting,” as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.


I'm really surprised MS deployment procedures allow a change to be rolled out globally across multiple regions on a single day. I suspect they soon won't.
change-management  cm  microsoft  outages  postmortems  azure  deployment  multi-region  flighting  azure-storage 
november 2014 by jm
Microsoft Azure 9-hour outage
'From 19 Nov, 2014 00:52 to 05:50 UTC a subset of customers using Storage, Virtual Machines, SQL Geo-Restore, SQL Import/export, Websites, Azure Search, Azure Cache, Management Portal, Service Bus, Event Hubs, Visual Studio, Machine Learning, HDInsights, Automation, Virtual Network, Stream Analytics, Active Directory, StorSimple and Azure Backup Services in West US and West Europe experienced connectivity issues. This incident has now been mitigated.'

There was knock-on impact until 11:00 UTC (storage in N Europe), 11:45 UTC (websites, West Europe), and 09:15 UTC (storage, West Europe), from the looks of things. Should be an interesting postmortem.
outages  azure  microsoft  ops 
november 2014 by jm
Adrian Cockroft's Cloud Outage Reports Collection
The detailed summaries of outages from cloud vendors are comprehensive and the response to each highlights many lessons in how to build robust distributed systems. For outages that significantly affected Netflix, the Netflix techblog report gives insight into how to effectively build reliable services on top of AWS. [....] I plan to collect reports here over time, and welcome links to other write-ups of outages and how to survive them.
outages  post-mortems  documentation  ops  aws  ec2  amazon  google  dropbox  microsoft  azure  incident-response 
march 2014 by jm
Microsoft's Azure Feb 29th, 2012 outage postmortem
'The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.' This caused cascading failures throughout the fleet. Ouch -- should have been spotted during code review
azure  dev  dates  leap-years  via:fanf  microsoft  outages  post-mortem  analysis  failure 
march 2012 by jm
CloudSplit – Real Time Cloud Analytics
interesting idea from Joe -- track your cloud-hosting spend in real-time
cloudsplit  hosting  amazon  ec2  azure  joe-drumgoole  analytics  real-time  from delicious
september 2009 by jm

Copy this bookmark:



description:


tags: