jm + distcomp   83

A Decade of Dynamo: Powering the next wave of high-performance, internet-scale applications - All Things Distributed
A deep dive on how we were using our existing databases revealed that they were frequently not used for their relational capabilities. About 70 percent of operations were of the key-value kind, where only a primary key was used and a single row would be returned. About 20 percent would return a set of rows, but still operate on only a single table.

With these requirements in mind, and a willingness to question the status quo, a small group of distributed systems experts came together and designed a horizontally scalable distributed database that would scale out for both reads and writes to meet the long-term needs of our business. This was the genesis of the Amazon Dynamo database.

The success of our early results with the Dynamo database encouraged us to write Amazon's Dynamo whitepaper and share it at the 2007 ACM Symposium on Operating Systems Principles (SOSP conference), so that others in the industry could benefit. The Dynamo paper was well-received and served as a catalyst to create the category of distributed database technologies commonly known today as "NoSQL."

That's not an exaggeration. Nice one Werner et al!
dynamo  history  nosql  storage  databases  distcomp  amazon  papers  acm  data-stores 
14 days ago by jm
Exactly-once Support in Apache Kafka – Jay Kreps
If you’re one of the people who think [exactly-once support is impossible], I’d ask you to take an actual look at what we actually know to be possible and impossible, and what has been built in Kafka, and hopefully come to a more informed opinion. So let’s address this in two parts. First, is exactly-once a theoretical impossibility? Second, how does Kafka support it.
exactly-once-delivery  distributed  kafka  distcomp  jay-kreps  coding  broadcast 
july 2017 by jm
Exactly-once Semantics is Possible: Here's How Apache Kafka Does it
How does this feature work? Under the covers it works in a way similar to TCP; each batch of messages sent to Kafka will contain a sequence number which the broker will use to dedupe any duplicate send. Unlike TCP, though—which provides guarantees only within a transient in-memory connection—this sequence number is persisted to the replicated log, so even if the leader fails, any broker that takes over will also know if a resend is a duplicate. The overhead of this mechanism is quite low: it’s just a few extra numeric fields with each batch of messages. As you will see later in this article, this feature add negligible performance overhead over the non-idempotent producer.
kafka  sequence-numbers  dedupe  deduplication  unique  architecture  distcomp  streaming  idempotence 
july 2017 by jm
Cadence: Microservice architecture beyond request/reply – @Scale
Uber’s request/reply handling middleware — based on the SWF API, it seems
swf  apis  microservices  uber  cadence  asynchronous  request-reply  distcomp  queueing  middleware  go 
june 2017 by jm
A Brief History of the UUID · Segment Blog
This is great, by Rick Branson. I didn't realise UUIDs came from Apollo
history  distributed  distcomp  uuids  ids  coding  apollo  unix 
june 2017 by jm
Hadoop Internals
This is the best documentation on the topic I've seen in a while
hadoop  map-reduce  architecture  coding  java  distcomp 
february 2017 by jm
Service discovery at Stripe
Writeup of their Consul-based service discovery system, a bit similar to smartstack. Good description of the production problems that they saw with Consul too, and also they figured out that strong consistency isn't actually what you want in a service discovery system ;)

HN comments are good too:
consul  api  microservices  service-discovery  dns  load-balancing  l7  tcp  distcomp  smartstack  stripe  cap-theorem  scalability 
november 2016 by jm
Doorman is a solution for Global Distributed Client Side Rate Limiting. Clients that talk to a shared resource (such as a database, a gRPC service, a RESTful API, or whatever) can use Doorman to voluntarily limit their use (usually in requests per second) of the resource. Doorman is written in Go and uses gRPC as its communication protocol. For some high-availability features it needs a distributed lock manager. We currently support etcd, but it should be relatively simple to make it use Zookeeper instead.

From google -- very interesting to see they're releasing this as open source, and it doesn't rely on G-internal services
distributed  distcomp  locking  youtube  golang  doorman  rate-limiting  rate-limits  limits  grpc  etcd 
july 2016 by jm
Collecting my thoughts about Torus
Worryingly-optimistic communications about CoreOS' recently-announced distributed storage system. I had similar thoughts, but Jeff Darcy is actually an expert on this stuff so he's way more worth listening to on the topic ;)
jeff-darcy  distcomp  filesystems  coreos  torus  storage 
june 2016 by jm
Lamport timestamps
'The algorithm of Lamport timestamps is a simple algorithm used to determine the order of events in a distributed computer system. As different nodes or processes will typically not be perfectly synchronized, this algorithm is used to provide a partial ordering of events with minimal overhead, and conceptually provide a starting point for the more advanced vector clock method. They are named after their creator, Leslie Lamport.'

See also vector clocks (which I think would be generally preferable nowadays).
vector-clocks  distributed  programming  algorithm  clocks  time  leslie-lamport  coding  distcomp 
may 2016 by jm
TIL: clock skew exists
good roundup of real-world clock skew links
clocks  clock-skew  ntp  realtime  time  bugs  distcomp  reliability  skew 
february 2016 by jm
How to do distributed locking
A critique of the "Redlock" locking algorithm from Redis by Martin Kleppman. antirez responds here: ; summary of followups:
distributed  locking  redis  algorithms  coding  distcomp  redlock  martin-kleppman  zookeeper 
february 2016 by jm
Discretized Streams: Fault Tolerant Stream Computing at Scale
The paper describing the innards of Spark Streaming and its RDD-based recomputation algorithm:
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.
rdd  spark  streaming  fault-tolerance  batch  distcomp  papers  big-data  scalability 
june 2015 by jm
Hybrid Logical Clocks
neat substitute for physical-time clocks in synchronization and ordering in a distributed system, based on Lamport's Logical Clocks and Google's TrueTime.

'HLC captures the causality relationship like LC, and enables easy identification of consistent snapshots in distributed systems. Dually, HLC can be used in lieu of PT clocks since it maintains its logical clock to be always close to the PT clock.'
hlc  clocks  logical-clocks  time  synchronization  ordering  events  logs  papers  algorithms  truetime  distcomp 
june 2015 by jm
Please stop calling databases CP or AP
In his excellent blog post [...] Jeff Hodges recommends that you use the CAP theorem to critique systems. A lot of people have taken that advice to heart, describing their systems as “CP” (consistent but not available under network partitions), “AP” (available but not consistent under network partitions), or sometimes “CA” (meaning “I still haven’t read Coda’s post from almost 5 years ago”).

I agree with all of Jeff’s other points, but with regard to the CAP theorem, I must disagree. The CAP theorem is too simplistic and too widely misunderstood to be of much use for characterizing systems. Therefore I ask that we retire all references to the CAP theorem, stop talking about the CAP theorem, and put the poor thing to rest. Instead, we should use more precise terminology to reason about our trade-offs.
cap  databases  storage  distcomp  ca  ap  cp  zookeeper  consistency  reliability  networking 
may 2015 by jm
You Cannot Have Exactly-Once Delivery
Cut out and keep:
Within the context of a distributed system, you cannot have exactly-once message delivery. Web browser and server? Distributed. Server and database? Distributed. Server and message queue? Distributed. You cannot have exactly-once delivery semantics in any of these situations.
distributed  distcomp  exactly-once-delivery  networking  outages  network-partitions  byzantine-generals  reference 
march 2015 by jm
Combining static model checking with dynamic enforcement using the Statecall Policy Language
This looks quite nice -- a model-checker "for regular programmers". Example model for ping(1):

<pre>01 automaton ping (int max_count, int count, bool can_timeout) {
02 Initialize;
03 during {
04 count = 0;
05 do {
06 Transmit_Ping;
07 either {
08 Receive_Ping;
09 } or (can_timeout) {
10 Timeout_Ping;
11 };
12 count = count + 1;
13 } until (count &gt;= max_count);
14 } handle {
16 Print_Summary;
17 };</pre>
ping  model-checking  models  formal-methods  verification  static  dynamic  coding  debugging  testing  distcomp  papers 
march 2015 by jm
Reliable Cron across the Planet - ACM Queue
How Google (hi Niall!) built their internal "distributed cron" service, using a Paxos-driven master election process at its core. I've been looking for a distributed cron for donkey's years, I wish someone would write a decent open source one....
distributed-systems  cron  acm  paxos  distributed-cron  master-election  distcomp  reliability 
march 2015 by jm
Stellar/Ripple suffer a failure of their consensus system, resulting in a split-brain failure
Prof. Mazières’s research indicated some risk that consensus could fail, though we were nor certain if the required circumstances for such a failure were realistic. This week, we discovered the first instance of a consensus failure. On Tuesday night, the nodes on the network began to disagree and caused a fork of the ledger. The majority of the network was on ledger chain A. At some point, the network decided to switch to ledger chain B. This caused the roll back of a few hours of transactions that had only been recorded on chain A. We were able to replay most of these rolled back transactions on chain B to minimize the impact. However, in cases where an account had already sent a transaction on chain B the replay wasn’t possible.
consensus  distcomp  stellar  ripple  split-brain  postmortems  outages  ledger-fork  payment 
december 2014 by jm
AWS re:Invent 2014 | (SPOT302) Under the Covers of AWS: Its Core Distributed Systems - YouTube
This is a really solid talk -- not surprising, alv@ is one of the speakers!
"AWS and operate some of the world's largest distributed systems infrastructure and applications. In our past 18 years of operating this infrastructure, we have come to realize that building such large distributed systems to meet the durability, reliability, scalability, and performance needs of AWS requires us to build our services using a few common distributed systems primitives. Examples of these primitives include a reliable method to build consensus in a distributed system, reliable and scalable key-value store, infrastructure for a transactional logging system, scalable database query layers using both NoSQL and SQL APIs, and a system for scalable and elastic compute infrastructure.

In this session, we discuss some of the solutions that we employ in building these primitives and our lessons in operating these systems. We also cover the history of some of these primitives -- DHTs, transactional logging, materialized views and various other deep distributed systems concepts; how their design evolved over time; and how we continue to scale them to AWS. "

scale  scaling  aws  amazon  dht  logging  data-structures  distcomp  via:marc-brooker  dynamodb  s3 
november 2014 by jm
Exactly-Once Delivery May Not Be What You Want
An extremely good explanation from Marc Brooker that exactly-once delivery in a distributed system is very hard.
And so on. There's always a place to slot in one more turtle. The bad news is that I'm not aware of a nice solution to the general problem for all side effects, and I suspect that no such solution exists. On the bright side, there are some very nice solutions that work really well in practice. The simplest is idempotence. This is a very simple idea: we make the tasks have the same effect no matter how many times they are executed.
architecture  messaging  queues  exactly-once-delivery  reliability  fault-tolerance  distcomp  marc-brooker 
november 2014 by jm
Great quote from Voldemort author Jay Kreps
"Reading papers: essential. Slavishly implementing ideas you read: not necessarily a good idea. Trust me, I wrote an Amazon Dynamo clone."

Later in the discussion, on complex conflict resolution logic (as used in Dynamo, Voldemort, and Riak):

"I reviewed 200 Voldemort stores, 190 used default lww conflict resolution. 10 had custom logic, all 10 of which had bugs." --

(although IMO I'd prefer complex resolution to non-availability, when AP is required)
voldemort  jay-kreps  dynamo  cap-theorem  ap  riak  papers  lww  conflict-resolution  distcomp 
november 2014 by jm
Carlos Baquero presents several operation, state-based CRDTs for use in AP systems like Voldemort and Riak
ap  cap-theorem  crdts  ricon  carlos-baquero  data-structures  distcomp 
october 2014 by jm
Mnesia and CAP
A common “trick” is to claim:

'We assume network partitions can’t happen. Therefore, our system is CA according to the CAP theorem.'

This is a nice little twist. By asserting network partitions cannot happen, you just made your system into one which is not distributed. Hence the CAP theorem doesn’t even apply to your case and anything can happen. Your system may be linearizable. Your system might have good availability. But the CAP theorem doesn’t apply. [...]
In fact, any well-behaved system will be “CA” as long as there are no partitions. This makes the statement of a system being “CA” very weak, because it doesn’t put honesty first. I tries to avoid the hard question, which is how the system operates under failure. By assuming no network partitions, you assume perfect information knowledge in a distributed system. This isn’t the physical reality.
cap  erlang  mnesia  databases  storage  distcomp  reliability  ca  postgres  partitions 
october 2014 by jm
The Saga pattern
'a distribution of long-living [distributed] transactions where steps may interleave, each with associated compensating transactions providing a compensation path across databases in the occurrence of a fault that may or may not compensate the entire chain back to the originator.'
distributed  messaging  saga  patterns  architecture  transactions  distributed-transactions  distcomp 
october 2014 by jm
mcrouter: A memcached protocol router for scaling memcached deployments
New from Facebook engineering:
Last year, at the Data@Scale event and at the USENIX Networked Systems Design and Implementation conference , we spoke about turning caches into distributed systems using software we developed called mcrouter (pronounced “mick-router”). Mcrouter is a memcached protocol router that is used at Facebook to handle all traffic to, from, and between thousands of cache servers across dozens of clusters distributed in our data centers around the world. It is proven at massive scale — at peak, mcrouter handles close to 5 billion requests per second. Mcrouter was also proven to work as a standalone binary in an Amazon Web Services setup when Instagram used it last year before fully transitioning to Facebook's infrastructure.

Today, we are excited to announce that we are releasing mcrouter’s code under an open-source BSD license. We believe it will help many sites scale more easily by leveraging Facebook’s knowledge about large-scale systems in an easy-to-understand and easy-to-deploy package.

This is pretty crazy -- basically turns a memcached cluster into a much more usable clustered-storage system, with features like shadowing production traffic, cold cache warmup, online reconfiguration, automatic failover, prefix-based routing, replicated pools, etc. Lots of good features.
facebook  scaling  cache  proxy  memcache  open-source  clustering  distcomp  storage 
september 2014 by jm
Aerospike's CA boast gets a thumbs-down from @aphyr
Specifically, @aerospikedb cannot offer cursor stability, repeatable read, snapshot isolation, or any flavor of serializability.
@nasav @aerospikedb At *best* you can offer Read Committed, which is not, I assert, what most people would expect from an "ACID" database.
aphyr  aerospike  availability  consistency  acid  transactions  distcomp  databases  storage 
september 2014 by jm
"Perspectives On The CAP Theorem" [pdf]
"We cannot achieve [CAP theorem] consistency and availability in a partition-prone network."
papers  cap  distcomp  cap-theorem  consistency  availability  partitions  network  reliability 
september 2014 by jm
Microservices - Not a free lunch! - High Scalability
Some good reasons not to adopt microservices blindly. Testability and distributed-systems complexity are my biggest fears
microservices  soa  devops  architecture  testing  distcomp 
august 2014 by jm
"In Search of an Understandable Consensus Algorithm"
Diego Ongaro and John Ousterhout, USENIX ATC 2014 -- won best paper for this paper on the Raft algorithm. (via Eoin Brazil)
raft  consensus  algorithms  distcomp  john-ousterhout  via:eoinbrazil  usenix  atc  papers  paxos 
july 2014 by jm
"The Tail at Scale"
by Jeffrey Dean and Luiz Andre Barroso, Google. A selection of Google's architectural mechanisms used to defeat 99th-percentile latency spikes: hedged requests, tied requests, micro-partitioning, selective replication, latency-induced probation, canary requests.
google  architecture  distcomp  soa  http  partitioning  replication  latency  99th-percentile  canary-requests  hedged-requests 
july 2014 by jm
Call me maybe: RabbitMQ
We used Knossos and Jepsen to prove the obvious: RabbitMQ is not a lock service. That investigation led to a discovery hinted at by the documentation: in the presence of partitions, RabbitMQ clustering will not only deliver duplicate messages, but will also drop huge volumes of acknowledged messages on the floor. This is not a new result, but it may be surprising if you haven’t read the docs closely–especially if you interpreted the phrase “chooses Consistency and Partition Tolerance” to mean, well, either of those things.
rabbitmq  network  partitions  failure  cap-theorem  consistency  ops  reliability  distcomp  jepsen 
june 2014 by jm
Use of Formal Methods at Amazon Web Services
Chris Newcombe, Marc Brooker, et al. writing about their experience using formal specification and model-checking languages (TLA+) in production in AWS:

The success with DynamoDB gave us enough evidence to present TLA+ to the broader engineering community at Amazon. This raised a challenge; how to convey the purpose and benefits of formal methods to an audience of software engineers? Engineers think in terms of debugging rather than ‘verification’, so we called the presentation “Debugging Designs”.

Continuing that metaphor, we have found that software engineers more readily grasp the concept and practical value of TLA+ if we dub it 'Exhaustively-testable pseudo-code'.

We initially avoid the words ‘formal’, ‘verification’, and ‘proof’, due to the widespread view that formal methods are impractical. We also initially avoid mentioning what the acronym ‘TLA’ stands for, as doing so would give an incorrect impression of complexity.

More slides at ; proggit discussion at
formal-methods  model-checking  tla  tla+  programming  distsys  distcomp  ebs  s3  dynamodb  aws  ec2  marc-brooker  chris-newcombe 
june 2014 by jm
"Replicated abstract data types: Building blocks for collaborative applications"
cited at as 'one of my favorite papers on CRDTs and provides practical pseudocode for learning how to implement CRDTs yourself', in a discussion on cemerick's "Distributed Systems and the End of the API":
distcomp  networking  distributed  crdts  algorithms  text  data-structures  cap 
may 2014 by jm
Building a Global, Highly Available Service Discovery Infrastructure with ZooKeeper
This is the written version of a presentation [Camille Fournier] made at the ZooKeeper Users Meetup at Strata/Hadoop World in October, 2012 (slides available here). This writeup expects some knowledge of ZooKeeper.

good advice from one of the ZK committers.
zookeeper  service-discovery  architecture  distcomp  camille-fournier  availability  wan  network 
may 2014 by jm
'Pickles & Spores: Improving Support for Distributed Programming in Scala
'Spores are "small units of possibly mobile functional behavior". They're a closure-like abstraction meant for use in distributed or concurrent environments. Spores provide a guarantee that the environment is effectively immutable, and safe to ship over the wire. Spores aim to give library authors some confidence in exposing functions (or, rather, spores) in public APIs for safe consumption in a distributed or concurrent environment.

The first part of the talk covers a simpler variant of spores as they are proposed for inclusion in Scala 2.11. The second part of the talk briefly introduces a current research project ongoing at EPFL which leverages Scala's type system to provide type constraints that give authors finer-grained control over spore capturing semantics. What's more, these type constraints can be composed during spore composition, so library authors are effectively able to propagate expert knowledge via these composable constraints.

The last part of the talk briefly covers Scala/Pickling, a fast new, open serialization framework.'
pickling  scala  presentations  spores  closures  fp  immutability  coding  distributed  distcomp  serialization  formats  network 
april 2014 by jm
Nice-looking new tool from Hashicorp; service discovery and configuration service, built on Raft for leader election, Serf for gossip-based messaging, and Go. Some features:

* Gossip is performed over both TCP and UDP;

* gossip messages are encrypted symmetrically and therefore secure from eavesdropping, tampering, spoofing and packet corruption (like the incident which brought down S3 for days: );

* exposes both a HTTP interface and (even better) DNS;

* includes explicit support for long-distance WAN operation as well as on LANs.

It all looks very practical and usable. MPL-licensed.

The only potential risk I can see is that expecting to receive config updates from a blocking poll of the HTTP interface needs some good "best practice" docs, to ensure that people don't mishandle the scenario where there is a network partition between your calling code and the Consul server/agent. Without any heartbeating protocol behind the scenes, HTTP is vulnerable to "hung connections" which would result in a config change being silently missed by the client until the connection eventually is timed out, either by the calling code or the client-side kernel. This could potentially take minutes to occur, which in some usage scenarios could be a big, unforeseen problem.
configuration  service-discovery  distcomp  raft  consensus-algorithms  go  mpl  open-source  dns  http  gossip-protocol  hashicorp 
april 2014 by jm
Shuffle Sharding
Colm MacCarthaigh writes about a simple sharding/load-balancing algorithm which uses randomized instance selection and optional additional compartmentalization. See also: continuous hashing, and
hashing  load-balancing  sharding  partitions  dist-sys  distcomp  architecture  coding 
april 2014 by jm
Scalable Atomic Visibility with RAMP Transactions
Great new distcomp protocol work from Peter Bailis et al:
We’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are. [...]

How they work: RAMP transactions allow readers and writers to proceed concurrently. Operations race, but readers autonomously detect the races and repair any non-atomic reads. The write protocol ensures readers never stall waiting for writes to arrive.

Why they scale: Clients can’t cause other clients to stall (via synchronization independence) and clients only have to contact the servers responsible for items in their transactions (via partition independence). As a consequence, there’s no mutual exclusion or synchronous coordination across servers.

The end result: RAMP transactions outperform existing approaches across a variety of workloads, and, for a workload of 95% reads, RAMP transactions scale to over 7 million ops/second on 100 servers at less than 5% overhead.
scale  synchronization  databases  distcomp  distributed  ramp  transactions  scalability  peter-bailis  protocols  sharding  concurrency  atomic  partitions 
april 2014 by jm
'Testing applications under slow or flaky network conditions can be difficult and time consuming. Blockade aims to make that easier. A config file defines a number of docker containers and a command line tool makes introducing controlled network problems simple.'

Open-source release from Dell's Cloud Manager team (ex-Enstratius), inspired by aphyr's Jepsen. Simulates packet loss using "tc netem", so no ability to e.g. drop packets on certain flows or certain ports. Still, looks very usable -- great stuff.
testing  docker  networking  distributed  distcomp  enstratius  jepsen  network  outages  partitions  cap  via:lusis 
february 2014 by jm
Mathematical Purity in Distributed Systems: CRDTs Without Fear
Via Tony Finch. Funnily enough, the example describes Swrve: mobile game analytics, backed by a CRDT-based eventually consistent data store ;)
storage  crdts  semilattice  idempotency  commutativity  data-structures  distcomp  eventual-consistency 
january 2014 by jm
Replicant: Replicated State Machines Made Easy
The next time you reach for ZooKeeper, ask yourself whether it provides the primitive you really need. If ZooKeeper's filesystem and znode abstractions truly meet your needs, great. But the odds are, you'll be better off writing your application as a replicated state machine.
zookeeper  paxos  replicant  replication  consensus  state-machines  distcomp 
december 2013 by jm
Kelly "kellabyte" Sommers on Redis' "relaxed CP" approach to the CAP theorem

Similar to ACID properties, if you partially provide properties it means the user has to _still_ consider in their application that the property doesn't exist, because sometimes it doesn't. In you're fsync example, if fsync is relaxed and there are no replicas, you cannot consider the database durable, just like you can't consider Redis a CP system. It can't be counted on for guarantees to be delivered. This is why I say these systems are hard for users to reason about. Systems that partially offer guarantees require in-depth knowledge of the nuances to properly use the tool. Systems that explicitly make the trade-offs in the designs are easier to reason about because it is more obvious and _predictable_.
kellabyte  redis  cp  ap  cap-theorem  consistency  outages  reliability  ops  database  storage  distcomp 
december 2013 by jm
Jeff Dean - Taming Latency Variability and Scaling Deep Learning [talk]
'what Jeff Dean and team have been up to at Google'. Reducing request latency in a network SOA architecture using backup requests, etc., via Ilya Grigorik
youtube  talks  google  low-latency  soa  architecture  distcomp  jeff-dean  networking 
november 2013 by jm
A Brief Tour of FLP Impossibility
One of the most important results in distributed systems theory was published in April 1985 by Fischer, Lynch and Patterson. Their short paper ‘Impossibility of Distributed Consensus with One Faulty Process’, which eventually won the Dijkstra award given to the most influential papers in distributed computing, definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment.

This particular result, known as the ‘FLP result’, settled a dispute that had been ongoing in distributed systems for the previous five to ten years. The problem of consensus – that is, getting a distributed network of processors to agree on a common value – was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps. In particular, the synchronous solution was resilient to faults, where processors crash and take no further part in the computation. Informally, synchronous models allow failures to be detected by waiting one entire step length for a reply from a processor, and presuming that it has crashed if no reply is received.

This kind of failure detection is impossible in an asynchronous setting, where there are no bounds on the amount of time a processor might take to complete its work and then respond with a message. Therefore it’s not possible to say whether a processor has crashed or is simply taking a long time to respond. The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem.
distributed-systems  flp  consensus-algorithms  algorithms  distcomp  papers  proofs 
november 2013 by jm
Storm at - London Storm Meetup 2013-06-18
Not just a Storm success story. Interesting slides indicating where a startup *stopped* using Storm as realtime wasn't useful to their customers
storm  realtime  hadoop  cascading  python  cep  anti-spam  events  architecture  distcomp  low-latency  slides  rabbitmq 
october 2013 by jm
The trouble with timestamps
Timestamps, as implemented in Riak, Cassandra, et al, are fundamentally unsafe ordering constructs. In order to guarantee consistency you, the user, must ensure locally monotonic and, to some extent, globally monotonic clocks. This is a hard problem, and NTP does not solve it for you. When wall clocks are not properly coupled to the operations in the system, causal constraints can be violated. To ensure safety properties hold all the time, rather than probabilistically, you need logical clocks.
clocks  time  distributed  databases  distcomp  ntp  via:fanf  aphyr  vector-clocks  last-write-wins  lww  cassandra  riak 
october 2013 by jm
Non-blocking transactional atomicity
interesting new distributed atomic transaction algorithm from Peter Bailis
algorithms  database  distributed  scalability  storage  peter-bailis  distcomp 
october 2013 by jm
Rapid read protection in Cassandra 2.0.2
Nifty new feature -- if a request takes over the 99th percentile for requests to that server, it'll be repeated against another replica. Unnecessary for Voldemort, of course, which queries all replicas anyway!
cassandra  nosql  replication  distcomp  latency  storage 
october 2013 by jm
Model checking for highly concurrent code
Applied formal methods in order to test distributed systems -- specifically GlusterFS:

I'll use an example from my own recent experience. I'm developing a new kind of replication for GlusterFS. To make sure the protocol behaves correctly even across multiple failures, I developed a Murphi model for it. [...]

I added a third failure [to the simulated model]. I didn't expect a three-node system to continue working if more than one of those were concurrent (the model allows the failures to be any mix of sequential and concurrent), but I expected it to fail cleanly without reaching an invalid state. Surprise! It managed to produce a case where a reader can observe values that go back in time. This might not make much sense without knowing the protocol involved, but it might give some idea of the crazy conditions a model checker will find that you couldn't possibly have considered. [...]

So now I have a bug to fix, and that's a good thing. Clearly, it involves a very specific set of ill-timed reads, writes, and failures. Could I have found it by inspection or ad-hoc analysis? Hell, no. Could I have found it by testing on live systems? Maybe, eventually, but it probably would have taken months for this particular combination to occur on its own. Forcing it to occur would require a lot of extra code, plus an exerciser that would amount to a model checker running 100x slower across machines than Murphi does. With enough real deployments over enough time it would have happened, but the only feasible way to prevent that was with model checking. These are exactly the kinds of bugs that are hardest to fix in the field, and that make users distrust distributed systems, so those of us who build such systems should use every tool at our disposal to avoid them.
model-checking  formal-methods  modelling  murphi  distcomp  distributed-systems  glusterfs  testing  protocols 
september 2013 by jm
Call me maybe: Kafka
Aphyr takes a look at Kafka 0.8's replication with the Jepsen test suite. It doesn't go great. Jay Kreps responds here:
jay-kreps  kafka  replication  distributed-systems  distcomp  networking  reliability  fault-tolerance  jepsen 
september 2013 by jm
Non-blocking transactional atomicity
Peter Bailis with an interesting distributed-storage atomicity algorithm for performing multi-record transactional updates
algorithms  nbta  transactions  databases  storage  distcomp  distributed  atomic  coding  eventual-consistency  crdts 
september 2013 by jm
"Scalable Eventually Consistent Counters over Unreliable Networks" [paper, pdf]

Counters are an important abstraction in distributed computing, and
play a central role in large scale geo-replicated systems, counting events such as web page impressions or social network "likes". Classic distributed counters, strongly consistent, cannot be made both available and partition-tolerant, due to the CAP Theorem, being unsuitable to large scale scenarios.

This paper defi nes Eventually Consistent Distributed Counters (ECDC) and presents an implementation of the concept, Hando ff Counters, that is scalable and works over unreliable networks. By giving up the sequencer aspect of classic distributed counters, ECDC implementations can be made AP in the CAP design space, while retaining the essence of counting. Handoff Counters are the first CRDT (Conflict-free Replicated Data Type) based mechanism that overcomes the identity explosion problem in naive CRDTs, such as G-Counters (where state size is linear in the number of independent actors that ever incremented the counter), by managing identities towards avoiding global propagation, and garbage collecting temporary entries. The approach used in Hando ff Counters is not restricted to counters, being more generally applicable to other data types with associative and commutative operations.
pdf  papers  eventual-consistency  counters  distributed-systems  distcomp  cap-theorem  ecdc  handoff-counters  crdts  data-structures  g-counters 
august 2013 by jm
Beating the CAP Theorem Checklist
'Your ( ) tweet ( ) blog post ( ) marketing material ( ) online comment
advocates a way to beat the CAP theorem. Your idea will not work. Here is why
it won't work:'

lovely stuff, via Bill De hOra
via:dehora  funny  cap  cs  distributed-systems  distcomp  networking  partitions  state  checklists 
august 2013 by jm
_In Search of an Understandable Consensus Algorithm_, Diego Ongaro and John Ousterhout, Stanford

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election and log replication, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety. Results from a user study demonstrate that Raft is easier for students to learn than Paxos.
distributed  algorithms  paxos  raft  consensus-algorithms  distcomp  leader-election  replication  clustering 
august 2013 by jm
Building a Modern Website for Scale (QCon NY 2013) [slides]
some great scalability ideas from LinkedIn. Particularly interesting are the best practices suggested for scaling web services:

1. store client-call timeouts and SLAs in Zookeeper for each REST endpoint;
2. isolate backend calls using async/threadpools;
3. cancel work on failures;
4. avoid sending requests to GC'ing hosts;
5. rate limits on the server.

#4 is particularly cool. They do this using a "GC scout" request before every "real" request; a cheap TCP request to a dedicated "scout" Netty port, which replies near-instantly. If it comes back with a 1-packet response within 1 millisecond, send the real request, else fail over immediately to the next host in the failover set.

There's still a potential race condition where the "GC scout" can be achieved quickly, then a GC starts just before the "real" request is issued. But the incidence of GC-blocking-request is probably massively reduced.

It also helps against packet loss on the rack or server host, since packet loss will cause the drop of one of the TCP packets, and the TCP retransmit timeout will certainly be higher than 1ms, causing the deadline to be missed. (UDP would probably work just as well, for this reason.) However, in the case of packet loss in the client's network vicinity, it will be vital to still attempt to send the request to the final host in the failover set regardless of a GC-scout failure, otherwise all requests may be skipped.

The GC-scout system also helps balance request load off heavily-loaded hosts, or hosts with poor performance for other reasons; they'll fail to achieve their 1 msec deadline and the request will be shunted off elsewhere.

For service APIs with real low-latency requirements, this is a great idea.
gc-scout  gc  java  scaling  scalability  linkedin  qcon  async  threadpools  rest  slas  timeouts  networking  distcomp  netty  tcp  udp  failover  fault-tolerance  packet-loss 
june 2013 by jm
Riak, CAP, and eventual consistency
Good (albeit draft) write-up of the implications of CAP, allow_mult, and last_write_wins conflict-resolution policies in Riak:
As Brewer's CAP theorem established, distributed systems have to make hard choices. Network partition is inevitable. Hardware failure is inevitable. When a partition occurs, a well-behaved system must choose its behavior from a spectrum of options ranging from "stop accepting any writes until the outage is resolved" (thus maintaining absolute consistency) to "allow any writes and worry about consistency later" (to maximize availability). Riak leans toward the availability end of the spectrum, but allows the operator and even the developer to tune read and write requests to better meet the business needs for any given set of data.
riak  cap  eventual-consistency  distcomp  distributed-systems  partition  last-write-wins  voldemort  allow_mult 
april 2013 by jm
Eventual Consistency Today: Limitations, Extensions, and Beyond - ACM Queue
Good overview of the current state of eventually-consistent data store research, covering CALM and CRDTs, from Peter Bailis and Ali Ghodsi
eventual-consistency  data  storage  horizontal-scaling  research  distcomp  distributed-systems  via:martin-thompson  crdts  calm  acid  cap 
april 2013 by jm
CRDTs - Commutative Replicated Data Types [pdf]

Shared read-only data is easy to scale by using well-understood replication techniques. However, sharing mutable data at a large scale is a dicult problem, because of the CAP impossibility result [5]. Two approaches dominate in practice. One ensures scalability by giving up consistency guarantees, for instance using the Last-Writer-Wins (LWW) approach [7]. The alternative guarantees consistency by serialising all updates, which does not scale beyond a small cluster [12]. Optimistic replication allows replicas to diverge, eventually resolving conflicts either by LWW-like methods or by serialisation [11].

In some (limited) cases, a radical simpli cation is possible. If concurrent updates to some datum commute, and all of its replicas execute all updates in causal order, then the replicas converge.1 We call this a Commutative Replicated Data Type (CRDT). The CRDT approach ensures that there are no conflicts, hence, no need for consensus-based concurrency control. CRDTs are not a universal solution, but, perhaps surprisingly, we were able to design highly useful CRDTs. This new research direction is promising as it ensures consistency in the large scale at a low cost, at least for some applications.
consistency  algorithms  concurrency  crdts  distcomp  data 
april 2013 by jm
Introducing Chronos: A Replacement for Cron
A distributed, fault-tolerant "cron" is something which comes up frequently -- it makes for a great fault-tolerance building block. This one sounds like it's too closely tied into Mesos, though (IMO).
Chronos is our replacement for cron. It is a distributed and fault-tolerant scheduler which runs on top of Mesos. It's a framework and supports custom mesos executors as well as the default command executor. Thus by default, Chronos executes SH (on most systems BASH) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the mesos slaves on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchroneous callbacks to notify Chronos of job completion or failures.
cron  scheduling  mesos  stacks  design  airbnb  chronos  fault-tolerance  distcomp  distributed-computing  scripts  jobs 
march 2013 by jm
Tactical Chat: How the U.S. Military Uses IRC to Wage War
Excellent stuff. Lessons to be learned from this: IRC has some key features that mean it can be useful in this case.
1. simple text, everything supports it, no fancy UI clients are necessary;
2. resilient against lossy/transient/low-bandwidth/high-latency networks;
3. standards-compliant and "battle-hardened" (so to speak);
4. open-source/non-proprietary.
Despite the U.S. military’s massive spending each year on advanced communications technology, the use of simple text chat or tactical chat has outpaced other systems to become one of the most popular paths for communicating practical information on the battlefield.  Though the use of text chat by the U.S. military first began in the early 1990s, in recent years tactical chat has evolved into a “primary ‘comms’ path, having supplanted voice communications as the primary means of common operational picture (COP) updating in support of situational awareness.”  An article from January 2012 in the Air Land Sea Bulletin describes the value of tactical chat as an effective and immediate communications method that is highly effective in distributed, intermittent, low bandwidth environments which is particularly important with “large numbers of distributed warfighters” who must “frequently jump onto and off of a network” and coordinate with other coalition partners.  Text chat also provides “persistency in situational understanding between those leaving and those assuming command watch duties” enabling a persistent record of tactical decision making.
A 2006 thesis from the Naval Postgraduate School states that internet relay chat (IRC) is one of the most widely used chat protocols for military command and control (C2).  Software such as mIRC, a Windows-based chat client, or integrated systems in C2 equipment are used primarily in tactical conditions though efforts are underway to upgrade systems to newer protocols. 

(via JK)
via:jk  war  irc  chat  mirc  us-military  tactical-chat  distcomp  networking 
march 2013 by jm
Netflix Curator
a high-level API that greatly simplifies using ZooKeeper. It adds many features that build on ZooKeeper and handles the complexity of managing connections to the ZooKeeper cluster and retrying operations. Some of the features are:

Automatic connection management: There are potential error cases that require ZooKeeper clients to recreate a connection and/or retry operations. Curator automatically and transparently (mostly) handles these cases.

Cleaner API: simplifies the raw ZooKeeper methods, events, etc.; provides a modern, fluent interface

Recipe implementations (see Recipes): Leader election, Shared lock, Path cache and watcher, Distributed Queue, Distributed Priority Queue
zookeeper  java  netflix  distcomp  libraries  oss  open-source  distributed 
march 2013 by jm
Marc Brooker's "two-randoms" load balancing approach
Marc Brooker on this interesting load-balancing algorithm, including simulation results:
Using stale data for load balancing leads to a herd behavior, where requests will herd toward a previously quiet host for much longer than it takes to make that host very busy indeed. The next refresh of the cached load data will put the server high up the load list, and it will become quiet again. Then busy again as the next herd sees that it's quiet. Busy. Quiet. Busy. Quiet. And so on. One possible solution would be to give up on load balancing entirely, and just pick a host at random. Depending on the load factor, that can be a good approach. With many typical loads, though, picking a random host degrades latency and reduces throughput by wasting resources on servers which end up unlucky and quiet.

The approach taken by the studies surveyed by Mitzenmacher is to try two hosts, and pick the one with the least load. This can be done directly (by querying the hosts) but also works surprisingly well on cached load data. [...] Best of 2 is good because it combines the best of both worlds: it uses real information about load to pick a host (unlike random), but rejects herd behavior much more strongly than the other two approaches.

Having seen what Marc has worked on, and written, inside Amazon, I'd take this very seriously... cool to see he is blogging externally too.
algorithm  load-balancing  distcomp  distributed  two-randoms  marc-brooker  least-conns 
february 2013 by jm
Timelike 2: everything fails all the time
Fantastic post on large-scale distributed load balancing strategies from @aphyr. Random and least-conns routing comes out on top in his simulation (although he hasn't yet tried Marc Brooker's two-randoms routing strategy)
via:hn  routing  distributed  least-conns  load-balancing  round-robin  distcomp  networking  scaling 
february 2013 by jm
Distributed Streams Algorithms for Sliding Windows [PDF]
'Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate
functions over a “sliding window” of the N most recent data items in one or more streams. [...] Our results are obtained using a novel family of synopsis data structures called waves.'
waves  papers  streaming  algorithms  percentiles  histogram  distcomp  distributed  aggregation  statistics  estimation  streams 
february 2013 by jm
Heroku finds out that distributed queueing is hard
Stage 3 of the Rap Genius/Heroku blog drama. Summary (as far as I can tell): Heroku gave up on a fully-synchronised load-balancing setup ("intelligent routing"), since it didn't scale, in favour of randomised queue selection; they didn't sufficiently inform their customers, and metrics and docs were not updated to make this change public; the pessimal case became pretty damn pessimal; a customer eventually noticed and complained publicly, creating a public shit-storm.

Comments: 1. this is why you monitor real HTTP request latency (scroll down for crazy graphs!). 2. include 90/99 percentiles to catch the "tail" of poorly-performing requests. 3. Load balancers are hard. has more info on the intricacies of distributed load balancing -- worth a read.
heroku  rap-genius  via:hn  networking  distcomp  distributed  load-balancing  ip  queueing  percentiles  monitoring 
february 2013 by jm
Clairvoyant Squirrel: Large Scale Malicious Domain Classification
Storm-based service to detect malicious DNS domain usage from streaming pcap data in near-real-time. Uses string features in the DNS domain, along with randomness metrics using Markov analysis, combined with a Random Forest classifier, to achieve 98% precision at 10,000 matches/sec
storm  distributed  distcomp  random-forest  classifiers  machine-learning  anti-spam  slides 
february 2013 by jm
Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm in Storm
Storm demo with a reasonably complex topology.
'how to implement a distributed, real-time trending topics algorithm in Storm. It uses the latest features available in Storm 0.8 (namely tick tuples) and should be a good starting point for anyone trying to implement such an algorithm for their own application. The new code is now available in the official storm-starter repository, so feel free to take a deeper look.'
storm  distcomp  distributed  tick-tuples  demo 
january 2013 by jm
Notes on Distributed Systems for Young Bloods
'Below is a list of some lessons I’ve learned as a distributed systems engineer that are worth being told to a new engineer. Some are subtle, and some are surprising, but none are controversial. This list is for the new distributed systems engineer to guide their thinking about the field they are taking on. It’s not comprehensive, but it’s a good beginning.' This is a pretty nice list, a little over-stated, but that's the format. I particularly like the following: 'Exploit data-locality'; 'Learn to estimate your capacity'; 'Metrics are the only way to get your job done'; 'Use percentiles, not averages'; 'Extract services'.
systems  distributed  distcomp  cap  metrics  coding  guidelines  architecture  backpressure  design  twitter 
january 2013 by jm
Trident: a high-level abstraction for realtime computation
built on Storm:

Trident is a new high-level abstraction for doing realtime computing on top of Twitter Storm, available in Storm 0.8.0. It allows you to seamlessly mix high throughput (millions of messages per second), stateful stream processing with low latency distributed querying. If you're familiar with high level batch processing tools like Pig or Cascading, the concepts of Trident will be very familiar - Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.
distributed  realtime  twitter  storm  trident  distcomp  stream-processing  low-latency  nathan-marz 
october 2012 by jm
Spanner: Google's Globally-Distributed Database [PDF]

Abstract: Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

To appear in:
OSDI'12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October, 2012.
database  distributed  google  papers  toread  pdf  scalability  distcomp  transactions  cap  consistency 
september 2012 by jm
MapReduce Patterns, Algorithms, and Use Cases
'I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found in the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting.'
algorithms  hadoop  java  mapreduce  patterns  distcomp 
february 2012 by jm
Hacker News thread on Storm
lots of good questions and answers in here
twitter  storm  distcomp  distributed 
september 2011 by jm
Storm: distributed and fault-tolerant realtime computation
intro slideshow to this really nifty-looking distcomp platform
distcomp  distributed  realtime  storm  slides  twitter 
september 2011 by jm
« earlier      
per page:    204080120160

related tags

99th-percentile  acid  acm  aerospike  aggregation  airbnb  algorithm  algorithms  allow_mult  amazon  anti-spam  ap  aphyr  api  apis  apollo  appengine  architecture  async  asynchronous  atc  atomic  automation  availability  aws  backpressure  batch  big-data  broadcast  bugs  byzantine-generals  ca  cache  cadence  calm  camille-fournier  canary-requests  cap  cap-theorem  carlos-baquero  cascading  cassandra  cdt  cep  chat  checklists  chris-newcombe  chronos  classifiers  clock-skew  clocks  closures  clustering  coding  command-line  commutativity  concurrency  configuration  conflict-resolution  consensus  consensus-algorithms  consistency  consul  coreos  counters  cp  crdt  crdts  cron  cs  data  data-stores  data-structures  database  databases  debugging  dedupe  deduplication  demo  design  devops  dht  dist-sys  distcomp  distributed  distributed-computing  distributed-cron  distributed-systems  distributed-transactions  distsys  dns  docker  doorman  dynamic  dynamo  dynamodb  ebs  ec2  ecdc  enstratius  erlang  estimation  etcd  events  eventual-consistency  exactly-once-delivery  facebook  failover  failure  fault-tolerance  filemap  files  filesystems  flash-crash  fleets  flp  formal-methods  formats  fp  funny  g-counters  gae  gc  gc-scout  glusterfs  go  golang  google  gossip-protocol  grid  grpc  guidelines  hadoop  handoff-counters  hashicorp  hashing  hedged-requests  heroku  high-frequency-trading  histogram  history  hlc  horizontal-scaling  http  idempotence  idempotency  ids  immutability  infrastructure  ip  irc  java  jay-kreps  jeff-darcy  jeff-dean  jepsen  jobs  john-ousterhout  kafka  kellabyte  l7  last-write-wins  latency  leader-election  least-conns  ledger-fork  leslie-lamport  libraries  limits  linkedin  live  load-balancing  locking  logging  logical-clocks  logs  low-latency  lww  machine-learning  map-reduce  mapreduce  marc-brooker  martin-kleppman  master-election  memcache  mesos  messaging  metrics  microservices  middleware  mirc  mnesia  model-checking  modelling  models  monitoring  mpl  multi-dc  murphi  nathan-marz  nbta  netflix  netty  network  network-partitions  networking  nosql  ntp  nyse  on-call  open-source  opensource  ops  ordering  oss  outages  packet-loss  papers  parallel  partition  partitioning  partitions  patterns  paxos  payment  pdf  percentiles  peter-bailis  pickling  ping  post-mortems  postgres  postmortems  presentation  presentations  production  programming  proofs  protocols  proxy  python  qcon  queueing  queues  rabbitmq  raft  ramp  random-forest  rap-genius  rate-limiting  rate-limits  rdd  realtime  redis  redlock  reference  reliability  replicant  replication  request-reply  research  rest  riak  ricon  ripple  round-robin  routing  s3  saga  scala  scalability  scale  scaling  scheduling  scripts  sec  semilattice  sequence-numbers  serialization  service-discovery  sharding  skew  slas  slides  smartstack  soa  spark  sparrow  split-brain  spores  stack  stacks  state  state-machines  static  statistics  stellar  stock-market  storage  storm  stream-processing  streaming  streams  stripe  swf  synchronization  systems  tactical-chat  talks  tcp  testing  text  threadpools  tick-tuples  time  timeouts  tla  tla+  toread  torus  tracer-requests  tracing  trading  transactions  trident  truetime  twitter  two-randoms  uber  udp  unique  unix  us-military  usenix  uuids  vector-clocks  verification  via:dehora  via:eoinbrazil  via:fanf  via:hn  via:jk  via:lusis  via:marc-brooker  via:martin-thompson  via:nelson  visualization  voldemort  wan  war  waves  youtube  zipkin  zookeeper 

Copy this bookmark: