jm + uptime   10

Sorry
hosted status page / downtime banner service
banners  web  status  uptime  downtime  ops  reliability 
4 hours ago by jm
A server with 24 years of uptime
wow. Stratus fault-tolerant systems ftw.

'This is a fault tolerant server, which means that hardware components are redundant. Over the years, disk drives, power supplies and some other components have been replaced but Hogan estimates that close to 80% of the system is original.'

(via internetofshit, which this isn't)
stratus  fault-tolerance  hardware  uptime  records  ops 
january 2017 by jm
Is Pokémon GO down or not?
(a) 83.5% uptime over 24 hours. GOOD JOB

(b) excellent marketing by Datadog!
datadog  games  monitoring  pokemon-go  pokemon  uptime 
july 2016 by jm
How Facebook avoids failures
Great paper from Ben Maurer of Facebook in ACM Queue.
A "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves.


This is full of interesting techniques.

* Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert.

* Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills.

* Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm -- this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker).

* Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed?

* Learning from Failure: the DERP (!) methodology,
ben-maurer  facebook  reliability  algorithms  codel  circuit-breakers  derp  failure  ops  cubism  horizon-charts  charts  dependencies  soa  microservices  uptime  deployment  configuration  change-management 
november 2015 by jm
A collection of postmortems
A well-maintained list with a potted description of each one (via HN)
postmortems  ops  uptime  reliability 
august 2015 by jm
Internet Scale Services Checklist
good aspirational checklist, inspired heavily by James Hamilton's seminal 2007 paper, "On Designing And Deploying Internet-Scale Services"
james-hamilton  checklists  ops  internet-scale  architecture  operability  monitoring  reliability  availability  uptime  aspirations 
april 2015 by jm
Yelp Product & Engineering Blog | True Zero Downtime HAProxy Reloads
Using tc and qdisc to delay SYNs while haproxy restarts. Definitely feels like on-host NAT between 2 haproxy processes would be cleaner and easier though!
linux  networking  hacks  yelp  haproxy  uptime  reliability  tcp  tc  qdisc  ops 
april 2015 by jm
huptime
Nice trick -- wrap servers with a libc wrapper to intercept bind(2) and accept(2) calls, so that transparent restarts becode possible
linux  ops  servers  uptime  restarting  libc  bind  accept  sockets 
january 2015 by jm
StatusPage.io
'Hosted Status Pages for Your Company'. We use these guys in $work, and their service is fantastic -- it's a line of javascript in the page template which will easily allow you to add a "service degraded" banner when things go pear-shaped, along with an external status site for when things get really messy. They've done a good clean job.
monitoring  server  status  outages  uptime  saas  infrastructure 
november 2014 by jm
'If I was your cloud provider, I'd never let you down'
This is the thing that's put me off Joyent. They make claims like this one from October 2012:
We’ve given our other partners 99.9999% uptime.


This despite a 10-day outage of their BingoDisk and Strongspace storage services in January 2008, 1734 days previously (http://www.datacenterknowledge.com/archives/2008/01/21/joyent-services-back-after-8-day-outage/).

If you assume that is the only outage they've had since then, that works out as 99.4% uptime. Quite a few less nines...
joyent  marketing  uptime  two-nines  fail  strongdisk 
june 2013 by jm

Copy this bookmark:



description:


tags: