jm + reliability   61

Air Canada near-miss: Air traffic controllers make split-second decisions in a culture of "psychological safety" — Quartz
“’Just culture’ as a term emerged from air traffic control in the late 1990s, as concern was mounting that air traffic controllers were unfairly cited or prosecuted for incidents that happened to them while they were on the job,” Sidney Dekker, a professor, writer, and director of the Safety Science Innovation Lab at Griffith University in Australia, explains to Quartz in an email. Eurocontrol, the intergovernmental organization that focuses on the safety of airspace across Europe, has “adopted a harmonized ‘just culture’ that it encourages all member countries and others to apply to their air traffic control organizations.”

[...] One tragic example of what can happen when companies don’t create a culture where employees feel empowered to raise questions or admit mistakes came to light in 2014, when an investigation into a faulty ignition switch that caused more than 100 deaths at GM Motors revealed a toxic culture of denying errors and deflecting blame within the firm. The problem was later attributed to one engineer who had not disclosed an obvious issue with the flawed switch, but many employees spoke of extreme pressure to put costs and delivery times before all other considerations, and to hide large and small concerns.

(via JG)
just-culture  atc  air-traffic-control  management  post-mortems  outages  reliability  air-canada  disasters  accidents  learning  psychological-safety  work 
17 days ago by jm
terrible review for Solidity as a programming environment in HN
"Solidity/EVM is by far the worst programming environment I have ever encountered. It would be impossible to write even toy programs correctly in this language, yet it is literally called "Solidity" and used to program a financial system that manages hundreds of millions of dollars."

Via Tony Finch
blockchain  ethereum  programming  coding  via:fanf  funny  fail  floating-point  money  json  languages  bugs  reliability 
25 days ago by jm
Hyrum's Law
"With a sufficient number of users of an API, it doesn't matter what you promised in the contract, all observable behaviours of your interface will be depended upon by somebody."
laws  funny  apis  reliability  hyrum-wright  hyrums-law  compatibilty  interfaces 
4 weeks ago by jm
hosted status page / downtime banner service
banners  web  status  uptime  downtime  ops  reliability 
12 weeks ago by jm
Fireside Chat with Vint Cerf & Marc Andreessen (Google Cloud Next '17) - YouTube
In which Vint Cerf calls for regulatory oversight of software engineering. "It's a serious issue now"
vint-cerf  gcp  regulation  oversight  politics  law  reliability  systems 
may 2017 by jm
"I caused an outage" thread on twitter
Anil Dash: "What was the first time you took the website down or broke the build? I’m thinking of all the inadvertent downtime that comes with shipping."

Sample response: 'Pushed a fatal error in lib/display.php to all of FB’s production servers one Friday night in late 2005. Site loaded blank pages for 20min.'
outages  reliability  twitter  downtime  fail  ops  post-mortem 
march 2017 by jm
Fault Domains and the Vegas Rule | Expedia Engineering Blog
I like this concept -- analogous to AWS' AZs -- limit blast radius of an outage by explicitly defining dependency scopes
aws  az  fault-domains  vegas-rule  blast-radius  outages  reliability  architecture 
february 2017 by jm
Kelsey Hightower - healthz: Stop reverse engineering applications and start monitoring from the inside [video]
his Monitorama 2016 talk, talking about the "deep health checks" concept (which I implemented at Swrve earlier this year ;)
monitorama  health  deep-health-checks  healthz  testing  availability  reliability 
july 2016 by jm
Cross-Region Read Replicas for Amazon Aurora
Creating a read replica in another region also creates an Aurora cluster in the region. This cluster can contain up to 15 more read replicas, with very low replication lag (typically less than 20 ms) within the region (between regions, latency will vary based on the distance between the source and target). You can use this model to duplicate your cluster and read replica setup across regions for disaster recovery. In the event of a regional disruption, you can promote the cross-region replica to be the master. This will allow you to minimize downtime for your cross-region application. This feature applies to unencrypted Aurora clusters.
aws  mysql  databases  storage  replication  cross-region  failover  reliability  aurora 
june 2016 by jm
TIL: clock skew exists
good roundup of real-world clock skew links
clocks  clock-skew  ntp  realtime  time  bugs  distcomp  reliability  skew 
february 2016 by jm
How Completely Messed Up Practices Become Normal
on Normalization of Deviance, with a few anecdotes from Silicon Valley. “The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization.”
normalization-of-deviance  deviance  bugs  culture  ops  reliability  work  workplaces  processes  norms 
december 2015 by jm
Files Are Hard
This is basically terrifying. A catalog of race conditions and reliability horrors around the POSIX filesystem abstraction in Linux -- it's a wonder anything works.

'Where’s this documented? Oh, in some mailing list post 6-8 years ago (which makes it 12-14 years from today). The fs devs whose posts I’ve read are quite polite compared to LKML’s reputation, and they generously spend a lot of time responding to basic questions, but it’s hard for outsiders to troll [sic] through a decade and a half of mailing list postings to figure out which ones are still valid and which ones have been obsoleted! I don’t mean to pick on filesystem devs. In their OSDI 2014 talk, the authors of the paper we’re discussing noted that when they reported bugs they’d found, developers would often respond “POSIX doesn’t let filesystems do that”, without being able to point to any specific POSIX documentation to support their statement. If you’ve followed Kyle Kingsbury’s Jepsen work, this may sound familiar, except devs respond with “filesystems don’t do that” instead of “networks don’t do that”.I think this is understandable, given how much misinformation is out there. Not being a filesystem dev myself, I’d be a bit surprised if I don’t have at least one bug in this post.'
filesystems  linux  unix  files  operating-systems  posix  fsync  osdi  papers  reliability 
december 2015 by jm
wangle/Codel.h at master · facebook/wangle
Facebook's open-source implementation of the CoDel queue management algorithm applied to server request-handling capacity in their C++ service bootstrap library, Wangle.
wangle  facebook  codel  services  capacity  reliability  queueing 
november 2015 by jm
How Facebook avoids failures
Great paper from Ben Maurer of Facebook in ACM Queue.
A "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves.

This is full of interesting techniques.

* Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert.

* Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills.

* Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm -- this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker).

* Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed?

* Learning from Failure: the DERP (!) methodology,
ben-maurer  facebook  reliability  algorithms  codel  circuit-breakers  derp  failure  ops  cubism  horizon-charts  charts  dependencies  soa  microservices  uptime  deployment  configuration  change-management 
november 2015 by jm
Elasticsearch and data loss
"@alexbfree @ThijsFeryn [ElasticSearch is] fine as long as data loss is acceptable. . We lose ~1% of all writes on average."
elasticsearch  data-loss  reliability  data  search  aphyr  jepsen  testing  distributed-systems  ops 
october 2015 by jm
A collection of postmortems
A well-maintained list with a potted description of each one (via HN)
postmortems  ops  uptime  reliability 
august 2015 by jm
Hystrix-style Circuit Breakers and Bulkheads for Ruby/Rails, from Shopify
circuit-breaker  bulkhead  patterns  architecture  microservices  shopify  rails  ruby  networking  reliability  fallback  fail-fast 
june 2015 by jm
Please stop calling databases CP or AP
In his excellent blog post [...] Jeff Hodges recommends that you use the CAP theorem to critique systems. A lot of people have taken that advice to heart, describing their systems as “CP” (consistent but not available under network partitions), “AP” (available but not consistent under network partitions), or sometimes “CA” (meaning “I still haven’t read Coda’s post from almost 5 years ago”).

I agree with all of Jeff’s other points, but with regard to the CAP theorem, I must disagree. The CAP theorem is too simplistic and too widely misunderstood to be of much use for characterizing systems. Therefore I ask that we retire all references to the CAP theorem, stop talking about the CAP theorem, and put the poor thing to rest. Instead, we should use more precise terminology to reason about our trade-offs.
cap  databases  storage  distcomp  ca  ap  cp  zookeeper  consistency  reliability  networking 
may 2015 by jm
Call me maybe: Aerospike
'Aerospike offers phenomenal latencies and throughput -- but in terms of data safety, its strongest guarantees are similar to Cassandra or Riak in Last-Write-Wins mode. It may be a safe store for immutable data, but updates to a record can be silently discarded in the event of network disruption. Because Aerospike’s timeouts are so aggressive–on the order of milliseconds -- even small network hiccups are sufficient to trigger data loss. If you are an Aerospike user, you should not expect “immediate”, “read-committed”, or “ACID consistency”; their marketing material quietly assumes you have a magical network, and I assure you this is not the case. It’s certainly not true in cloud environments, and even well-managed physical datacenters can experience horrible network failures.'
aerospike  outages  cap  testing  jepsen  aphyr  databases  storage  reliability 
may 2015 by jm
Call me maybe: Elasticsearch 1.5.0
tl;dr: Elasticsearch still hoses data integrity on partition, badly
elasticsearch  reliability  data  storage  safety  jepsen  testing  aphyr  partition  network-partitions  cap 
may 2015 by jm
Internet Scale Services Checklist
good aspirational checklist, inspired heavily by James Hamilton's seminal 2007 paper, "On Designing And Deploying Internet-Scale Services"
james-hamilton  checklists  ops  internet-scale  architecture  operability  monitoring  reliability  availability  uptime  aspirations 
april 2015 by jm
Making Pinterest — Learn to stop using shiny new things and love MySQL
'The third reason people go for shiny is because older tech isn’t advertised as aggressively as newer tech. The younger companies needs to differentiate from the old guard and be bolder, more passionate and promise to fulfill your wildest dreams. But most new tech sales pitches aren’t generally forthright about their many failure modes. In our early days, we fell into this third trap. We had a lot of growing pains as we scaled the architecture. The most vocal and excited database companies kept coming to us saying they’d solve all of our scalability problems. But nobody told us of the virtues of MySQL, probably because MySQL just works, and people know about it.'

It's true! -- I'm still a happy MySQL user for some use cases, particularly read-mostly relational configuration data...
mysql  storage  databases  reliability  pinterest  architecture 
april 2015 by jm
Large-scale cluster management at Google with Borg
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines. It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.
We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

(via Conall)
via:conall  clustering  google  papers  scale  to-read  borg  cluster-management  deployment  packing  reliability  redundancy 
april 2015 by jm
Yelp Product & Engineering Blog | True Zero Downtime HAProxy Reloads
Using tc and qdisc to delay SYNs while haproxy restarts. Definitely feels like on-host NAT between 2 haproxy processes would be cleaner and easier though!
linux  networking  hacks  yelp  haproxy  uptime  reliability  tcp  tc  qdisc  ops 
april 2015 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
On Ruby
The horrors of monkey-patching:
I call out the Honeybadger gem specifically because was the most recent time I'd been bit by a seemingly good thing promoted in the community: monkey patching third party code. Now I don't fault Honeybadger for making their product this way. It provides their customers with direct business value: "just require 'honeybadger' and you're done!" I don't agree with this sort of practice. [....]

I distrust everything [in Ruby] but a small set of libraries I've personally vetted or are authored by people I respect. Why is this important? Without a certain level of scrutiny you will introduce odd and hard to reproduce bugs. This is especially important because Ruby offers you absolutely zero guarantee whatever the state your program is when a given method is dispatched. Constants are not constants. Methods can be redefined at run time. Someone could have written a time sensitive monkey patch to randomly undefined methods from anything in ObjectSpace because they can. This example is so horribly bad that no one should every do, but the programming language allows this. Much worse, this code be arbitrarily inject by some transitive dependency (do you even know what yours are?).
ruby  monkey-patching  coding  reliability  bugs  dependencies  libraries  honeybadger  sinatra 
april 2015 by jm
Reliable Cron across the Planet - ACM Queue
How Google (hi Niall!) built their internal "distributed cron" service, using a Paxos-driven master election process at its core. I've been looking for a distributed cron for donkey's years, I wish someone would write a decent open source one....
distributed-systems  cron  acm  paxos  distributed-cron  master-election  distcomp  reliability 
march 2015 by jm
demonstration of the importance of server-side request timeouts
from MongoDB, but similar issues often apply in many other TCP/HTTP-based systems
tcp  http  requests  timeout  mongodb  reliability  safety 
march 2015 by jm
Exponential Backoff And Jitter
Great go-to explainer blog post for this key distributed-systems reliability concept, from the always-solid Marc Brooker
marc-brooker  distsys  networking  backoff  exponential  jitter  retrying  retries  reliability  occ 
march 2015 by jm
Services Engineering Reading List
good list of papers/articles for fans of scalability etc.
architecture  papers  reading  reliability  scalability  articles  to-read 
march 2015 by jm
Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth | USENIX
Erasure codes, such as Reed-Solomon (RS) codes, are increasingly being deployed as an alternative to data-replication for fault tolerance in distributed storage systems. While RS codes provide significant savings in storage space, they can impose a huge burden on the I/O and network resources when reconstructing failed or otherwise unavailable data. A recent class of erasure codes, called minimum-storage-regeneration (MSR) codes, has emerged as a superior alternative to the popular RS codes, in that it minimizes network transfers during reconstruction while also being optimal with respect to storage and reliability. However, existing practical MSR codes do not address the increasingly important problem of I/O overhead incurred during reconstructions, and are, in general, inferior to RS codes in this regard. In this paper, we design erasure codes that are simultaneously optimal in terms of I/O, storage, and network bandwidth. Our design builds on top of a class of powerful practical codes, called the product-matrix-MSR codes. Evaluations show that our proposed design results in a significant reduction the number of I/Os consumed during reconstructions (a 5 reduction for typical parameters), while retaining optimality with respect to storage, reliability, and network bandwidth.
erasure-coding  reed-solomon  compression  reliability  reconstruction  replication  fault-tolerance  storage  bandwidth  usenix  papers 
february 2015 by jm
Two recent systemd crashes
Hey look, PID 1 segfaulting! I haven't seen that happen since we managed to corrupt /bin/sh on Ultrix in 1992. Nice work Fedora
fedora  reliability  unix  linux  systemd  ops  bugs 
december 2014 by jm
If Eventual Consistency Seems Hard, Wait Till You Try MVCC
ex-Percona MySQL wizard Baron Schwartz, noting that MVCC as implemented in common SQL databases is not all that simple or reliable compared to big bad NoSQL Eventual Consistency:
Since I am not ready to assert that there’s a distributed system I know to be better and simpler than eventually consistent datastores, and since I certainly know that InnoDB’s MVCC implementation is full of complexities, for right now I am probably in the same position most of my readers are: the two viable choices seem to be single-node MVCC and multi-node eventual consistency. And I don’t think MVCC is the simpler paradigm of the two.
nosql  concurrency  databases  mysql  riak  voldemort  eventual-consistency  reliability  storage  baron-schwartz  mvcc  innodb  postgresql 
december 2014 by jm
Exactly-Once Delivery May Not Be What You Want
An extremely good explanation from Marc Brooker that exactly-once delivery in a distributed system is very hard.
And so on. There's always a place to slot in one more turtle. The bad news is that I'm not aware of a nice solution to the general problem for all side effects, and I suspect that no such solution exists. On the bright side, there are some very nice solutions that work really well in practice. The simplest is idempotence. This is a very simple idea: we make the tasks have the same effect no matter how many times they are executed.
architecture  messaging  queues  exactly-once-delivery  reliability  fault-tolerance  distcomp  marc-brooker 
november 2014 by jm
DynamoDB Streams
This is pretty awesome. All changes to a DynamoDB table can be streamed to a Kinesis stream, MySQL-replication-style.

The nice bit is that it has a solid way to ensure readers won't get overwhelmed by the stream volume (since ddb tables are IOPS-rate-limited), and Kinesis has a solid way to read missed updates (since it's a Kafka-style windowed persistent stream). With this you have a pretty reliable way to ensure you're not going to suffer data loss.
iops  dynamodb  aws  kinesis  reliability  replication  multi-az  multi-region  failover  streaming  kafka 
november 2014 by jm
Brownout: building more robust cloud applications
Applications can saturate – i.e. become unable to serve users in a timely manner. Some users may experience high latencies, while others may not receive any service at all. The authors argue that it is better to downgrade the user experience and continue serving a larger number of clients with reasonable latency.

"We define a cloud application as brownout compliant if it can gradually downgrade user experience to avoid saturation."

This is actually very reminiscent of circuit breakers, as described in Nygard’s ‘Release It!’ and popularized by Netflix. If you’re already designing with circuit breakers, you’ve probably got all the pieces you need to add brownout support to your application relatively easily.

"Our work borrows from the concept of brownout in electrical grids. Brownouts are an intentional voltage drop often used to prevent blackouts through load reduction in case of emergency. In such a situation, incandescent light bulbs dim, hence originating the term."
"To lower the maintenance effort, brownouts should be automatically triggered. This enables cloud applications to rapidly and robustly avoid saturation due to unexpected environmental changes, lowering the burden on human operators."

This is really similar to the Circuit Breaker pattern -- in fact it feels to me like a variation on that, driven by measured latencies of operations/requests.

See also .
circuit-breaker  patterns  brownout  robustness  reliability  load  latencies  degradation 
october 2014 by jm
Mnesia and CAP
A common “trick” is to claim:

'We assume network partitions can’t happen. Therefore, our system is CA according to the CAP theorem.'

This is a nice little twist. By asserting network partitions cannot happen, you just made your system into one which is not distributed. Hence the CAP theorem doesn’t even apply to your case and anything can happen. Your system may be linearizable. Your system might have good availability. But the CAP theorem doesn’t apply. [...]
In fact, any well-behaved system will be “CA” as long as there are no partitions. This makes the statement of a system being “CA” very weak, because it doesn’t put honesty first. I tries to avoid the hard question, which is how the system operates under failure. By assuming no network partitions, you assume perfect information knowledge in a distributed system. This isn’t the physical reality.
cap  erlang  mnesia  databases  storage  distcomp  reliability  ca  postgres  partitions 
october 2014 by jm
Understanding weak isolation is a serious problem
Peter Bailis complaining about the horrors of modern transactional databases and their unserializability, which noone seems to be paying attention to:

'As you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional inherents and the research community as a whole have effectively ignored weak isolation — even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.'

'Despite the ubiquity of weak isolation, I haven’t found a database architect, researcher, or user who’s been able to offer an explanation of when, and, probably more importantly, why isolation models such as Read Committed are sufficient for correct execution. It’s reasonably well known that these weak isolation models represent “ACID in practice,” but I don’t think we have any real understanding of how so many applications are seemingly (!?) okay running under them. (If you haven’t seen these models before, they’re a little weird. For example, Read Committed isolation generally prevents users from reading uncommitted or non-final writes but allows a number of bad things to happen, like lost updates during concurrent read-modify-write operations. Why is this apparently okay for many applications?)'
acid  consistency  databases  peter-bailis  transactional  corruption  serializability  isolation  reliability 
september 2014 by jm
"Perspectives On The CAP Theorem" [pdf]
"We cannot achieve [CAP theorem] consistency and availability in a partition-prone network."
papers  cap  distcomp  cap-theorem  consistency  availability  partitions  network  reliability 
september 2014 by jm
Call me maybe: Elasticsearch
Wow, these are terrible results. From the sounds of it, ES just cannot deal with realistic outage scenarios and is liable to suffer catastrophic damage in reasonably-common partitions.
If you are an Elasticsearch user (as I am): good luck. Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present. If you can, store your data in a safer database, and feed it into Elasticsearch gradually. Have processes in place that continually traverse the system of record, so you can recover from ES data loss automatically.
elasticsearch  ops  storage  databases  jepsen  partition  network  outages  reliability 
june 2014 by jm
Call me maybe: RabbitMQ
We used Knossos and Jepsen to prove the obvious: RabbitMQ is not a lock service. That investigation led to a discovery hinted at by the documentation: in the presence of partitions, RabbitMQ clustering will not only deliver duplicate messages, but will also drop huge volumes of acknowledged messages on the floor. This is not a new result, but it may be surprising if you haven’t read the docs closely–especially if you interpreted the phrase “chooses Consistency and Partition Tolerance” to mean, well, either of those things.
rabbitmq  network  partitions  failure  cap-theorem  consistency  ops  reliability  distcomp  jepsen 
june 2014 by jm
Resiliency And Elasticsearch
Blog post from the ES team. They use "evil tests" -- basically unit/system tests, particularly using randomized error-injecting mock infrastructure. Good practices; I've done the same myself quite recently for Swrve's realtime infrastructure
elasticsearch  resiliency  network-partitions  reliability  testing  mocking  error-injection 
april 2014 by jm
Good explanation of exponential backoff
I've often had to explain this key feature verbosely, and it's hard to do without handwaving. Great to have a solid, well-explained URL to point to
exponential-backoff  backoff  retries  reliability  web-services  http  networking  internet  coding  design 
march 2014 by jm
ZooKeeper Resilience at Pinterest
essentially decoupling the client services from ZK using a local daemon on each client host; very similar to Airbnb's Smartstack. This is a bit of an indictment of ZK's usability though
ops  architecture  clustering  network  partitions  cap  reliability  smartstack  airbnb  pinterest  zookeeper 
march 2014 by jm
"Understanding the Robustness of SSDs under Power Fault", FAST '13 [paper]
Horrific. SSDs (including "enterprise-class storage") storing sync'd writes in volatile RAM while claiming they were synced; one device losing 72.6GB, 30% of its data, after 8 injected power faults; and all SSDs tested displayed serious errors including random bit errors, metadata corruption, serialization errors and shorn writes. Don't trust lone unreplicated, unbacked-up SSDs!
pdf  papers  ssd  storage  reliability  safety  hardware  ops  usenix  serialization  shorn-writes  bit-errors  corruption  fsync 
january 2014 by jm
BitCoin exchange CoinBase uses MongoDB as their 'primary datastore'
'Coinbase uses MongoDB for their primary datastore for their web app, api requests, etc.'
coinbase  mongodb  reliability  hn  via:aphyr  ops  banking  bitcoin 
december 2013 by jm
Kelly "kellabyte" Sommers on Redis' "relaxed CP" approach to the CAP theorem

Similar to ACID properties, if you partially provide properties it means the user has to _still_ consider in their application that the property doesn't exist, because sometimes it doesn't. In you're fsync example, if fsync is relaxed and there are no replicas, you cannot consider the database durable, just like you can't consider Redis a CP system. It can't be counted on for guarantees to be delivered. This is why I say these systems are hard for users to reason about. Systems that partially offer guarantees require in-depth knowledge of the nuances to properly use the tool. Systems that explicitly make the trade-offs in the designs are easier to reason about because it is more obvious and _predictable_.
kellabyte  redis  cp  ap  cap-theorem  consistency  outages  reliability  ops  database  storage  distcomp 
december 2013 by jm
Call me maybe: Kafka
Aphyr takes a look at Kafka 0.8's replication with the Jepsen test suite. It doesn't go great. Jay Kreps responds here:
jay-kreps  kafka  replication  distributed-systems  distcomp  networking  reliability  fault-tolerance  jepsen 
september 2013 by jm
_Availability in Globally Distributed Storage Systems_ [pdf]
empirical BigTable and GFS failure numbers from Google are orders of magnitude higher than naïve independent-failure models. (via kragen)
via:kragen  failure  bigtable  gfs  statistics  outages  reliability 
september 2013 by jm
Getting Real About Distributed System Reliability
I have come around to the view that the real core difficulty of [distributed] systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters. Worse yet, you can’t easily buy good operations in the same way you can buy good software—you might be able to hire good people (if you can find them) but this is more than just people; it is practices, monitoring systems, configuration management, etc.
reliability  nosql  distributed-systems  jay-kreps  ops 
september 2013 by jm
'Copysets: Reducing the Frequency of Data Loss in Cloud Storage' [paper]
An improved replica-selection algorithm for replicated storage systems.

We present Copyset Replication, a novel general purpose replication technique that significantly reduces the frequency of data loss events. We implemented and evaluated Copyset Replication on two open source data center storage systems, HDFS and RAMCloud, and show it incurs a low overhead on all operations. Such systems require that each node’s data be scattered across several nodes for parallel data recovery and access. Copyset Replication presents a near optimal tradeoff between the number of nodes on which the data is scattered and the probability of data loss. For example, in a 5000-node RAMCloud cluster under a power outage, Copyset Replication reduces the probability of data loss from 99.99% to 0.15%. For Facebook’s HDFS cluster, it reduces the probability from 22.8% to 0.78%.
storage  cloud-storage  replication  data  reliability  fault-tolerance  copysets  replicas  data-loss 
july 2013 by jm
Stability Patterns and Antipatterns [slides]
Michael "Release It!" Nygard's slides from a recent O'Reilly event, discussing large-scale service reliability design patterns
michael-nygard  design-patterns  architecture  systems  networking  reliability  soa  slides  pdf 
may 2013 by jm
Excel, untestability, and the reliability of quants
Wow, this is a great software-quality story -- I knew Excel was the most widely used programming environment out there, but this is a factor I'd overlooked:

In his remarks on the final panel, Frank Partnoy mentioned something I missed when it came out a few weeks ago: the role of Microsoft Excel in the “London Whale” trading debacle. [..] To summarize: JPMorgan’s Chief Investment Office needed a new value-at-risk (VaR) model for the synthetic credit portfolio (the one that blew up) and assigned a quantitative whiz [...] to create it. The new model “operated through a series of Excel spreadsheets, which had to be completed manually, by a process of copying and pasting data from one spreadsheet to another.” The internal Model Review Group identified this problem as well as a few others, but approved the model, while saying that it should be automated and another significant flaw should be fixed. After the London Whale trade blew up, the Model Review Group discovered that the model had not been automated and found several other errors. Most spectacularly, “After subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average, as the modeler had intended. This error likely had the effect of muting volatility by a factor of two and of lowering the VaR ...”

I write periodically about the perils of bad software in the business world in general and the financial industry in particular, by which I usually mean back-end enterprise software that is poorly designed, insufficiently tested, and dangerously error-prone. But this is something different. [...] While Excel the program is reasonably robust, the spreadsheets that people create with Excel are incredibly fragile. There is no way to trace where your data come from, there’s no audit trail (so you can overtype numbers and not know it), and there’s no easy way to test spreadsheets, for starters. The biggest problem is that anyone can create Excel spreadsheets -- badly. Because it’s so easy to use, the creation of even important spreadsheets is not restricted to people who understand programming and do it in a methodical, well-documented way.

This is why the JPMorgan VaR model is the rule, not the exception: manual data entry, manual copy-and-paste, and formula errors. This is another important reason why you should pause whenever you hear that banks’ quantitative experts are smarter than Einstein, or that sophisticated risk management technology can protect banks from blowing up. At the end of the day, it’s all software. While all software breaks occasionally, Excel spreadsheets break all the time. But they don’t tell you when they break: they just give you the wrong number.
excel  reliability  software  coding  ides  jpmorgan  value-at-risk  finance  london-whale  quants  spreadsheets  unit-tests  testability  testing 
april 2013 by jm
JPL Institutional Coding Standard for the Java Programming Language
From JPL's Laboratory for Reliable Software (LaRS). Great reference; there's some really useful recommendations here, and good explanations of familiar ones like "prefer composition over inheritance". Many are supported by FindBugs, too.

Here's the full list:

compile with checks turned on;
apply static analysis;
document public elements;
write unit tests;
use the standard naming conventions;
do not override field or class names;
make imports explicit;
do not have cyclic package and class dependencies;
obey the contract for equals();
define both equals() and hashCode();
define equals when adding fields;
define equals with parameter type Object;
do not use finalizers;
do not implement the Cloneable interface;
do not call nonfinal methods in constructors;
select composition over inheritance;
make fields private;
do not use static mutable fields;
declare immutable fields final;
initialize fields before use;
use assertions;
use annotations;
restrict method overloading;
do not assign to parameters;
do not return null arrays or collections;
do not call System.exit;
have one concept per line;
use braces in control structures;
do not have empty blocks;
use breaks in switch statements;
end switch statements with default;
terminate if-else-if with else;
restrict side effects in expressions;
use named constants for non-trivial literals;
make operator precedence explicit;
do not use reference equality;
use only short-circuit logic operators;
do not use octal values;
do not use floating point equality;
use one result type in conditional expressions;
do not use string concatenation operator in loops;
do not drop exceptions;
do not abruptly exit a finally block;
use generics;
use interfaces as types when available;
use primitive types;
do not remove literals from collections;
restrict numeric conversions;
program against data races;
program against deadlocks;
do not rely on the scheduler for synchronization;
wait and notify safely;
reduce code complexity
nasa  java  reference  guidelines  coding-standards  jpl  reliability  software  coding  oo  concurrency  findbugs  bugs 
march 2013 by jm
Advanced PostMortem Fu and Human Error 101 (Velocity 2011)
John Allspaw's previous slides on Etsy's operations culture -- this'll be old hat to Amazon staff of course ;)
etsy  devops  engineering  operations  reliability  mttd  mttr  postmortems 
march 2012 by jm
iPhone 3GS GPS suddenly stops working? here's the fix
via a forum on MacRumors -- blow away the locationd cache. Worked perfectly for me after my GPS crapped out halfway through my holidays :( Requires that the phone be jailbroken first
iphone  gps  software  3gs  reliability  bugs  macrumors  jailbreaking  locationd  from delicious
may 2010 by jm
Post-mortem for February 24th, 2010 outage - Google App Engine
extremely detailed; power outage in the primary DC resulted in a degraded fleet, and on-calls didn't have up-to-date on-call docs to respond correctly
google  gae  appengine  outages  post-mortems  multi-dc  reliability  distcomp  fleets  on-call  from delicious
march 2010 by jm

related tags

3gs  accidents  acid  acm  advice  aerospike  air-canada  air-traffic-control  airbnb  algorithms  allspaw  ap  aphyr  apis  appengine  architecture  articles  aspirations  atc  aurora  availability  aws  az  backoff  bandwidth  banking  banners  baron-schwartz  ben-maurer  bigtable  bit-errors  bitcoin  blast-radius  blockchain  borg  brownout  bugs  bulkhead  ca  cap  cap-theorem  capacity  cassandra  change-management  charts  checklists  circuit-breaker  circuit-breakers  clock-skew  clocks  cloud-storage  cluster-management  clustering  codeascraft  codel  coding  coding-standards  coinbase  compatibilty  compression  concurrency  configuration  consistency  copysets  corruption  counters  cp  cron  cross-region  cubism  culture  data  data-loss  database  databases  deep-health-checks  degradation  dependencies  deployment  derp  design  design-patterns  deviance  devops  disasters  distcomp  distributed  distributed-cron  distributed-systems  distsys  downtime  dynamodb  elasticsearch  engineering  erasure-coding  erlang  error-injection  ethereum  etsy  eventual-consistency  exactly-once-delivery  excel  exception-handling  exponential  exponential-backoff  facebook  fail  fail-fast  failover  failure  fallback  fault-domains  fault-tolerance  fedora  files  filesystems  finance  findbugs  five-whys  fleets  floating-point  fsync  funny  gae  gcp  gfs  google  gps  guidelines  hacks  haproxy  hardware  hbase  hdfs  health  healthz  hn  honeybadger  horizon-charts  http  hyrum-wright  hyrums-law  ides  incident-response  incidents  innodb  interfaces  internet  internet-scale  iops  iphone  isolation  jailbreaking  james-hamilton  java  jay-kreps  jepsen  jitter  jpl  jpmorgan  json  just-culture  kafka  kellabyte  kinesis  languages  latencies  latency  law  laws  learning  libraries  linux  load  locationd  london-whale  macrumors  management  mapreduce  marc-brooker  master-election  messaging  michael-nygard  microservices  mnesia  mocking  money  mongodb  monitorama  monitoring  monkey-patching  mttd  mttr  multi-az  multi-dc  multi-region  mvcc  mysql  nasa  network  network-partitions  networking  normalization-of-deviance  norms  nosql  ntp  occ  on-call  oo  operability  operating-systems  operations  ops  osdi  outages  oversight  packing  papers  partition  partitions  patterns  paxos  pdf  peter-bailis  pinterest  politics  posix  post-mortem  post-mortems  postgres  postgresql  postmortems  processes  programming  psychological-safety  qdisc  quants  queueing  queues  rabbitmq  race-conditions  rails  reading  realtime  reconstruction  redis  redundancy  reed-solomon  reference  regulation  reliability  replicas  replication  requests  resiliency  retries  retrying  riak  robustness  root-cause  ruby  safety  scalability  scale  scaling  search  serializability  serialization  server-side  services  severity  shopify  shorn-writes  sinatra  skew  slides  smartstack  soa  software  spreadsheets  ssd  startup  statistics  status  storage  streaming  systemd  systems  tc  tcp  techops  tellybug  testability  testing  tier-one-support  time  timeout  to-read  transactional  twitter  unit-tests  unix  uptime  usenix  value-at-risk  vegas-rule  via:aphyr  via:conall  via:fanf  via:kragen  vint-cerf  voldemort  wangle  web  web-services  work  workplaces  yelp  zookeeper 

Copy this bookmark: