jm + distributed   47

Exactly-once Support in Apache Kafka – Jay Kreps
If you’re one of the people who think [exactly-once support is impossible], I’d ask you to take an actual look at what we actually know to be possible and impossible, and what has been built in Kafka, and hopefully come to a more informed opinion. So let’s address this in two parts. First, is exactly-once a theoretical impossibility? Second, how does Kafka support it.
exactly-once-delivery  distributed  kafka  distcomp  jay-kreps  coding  broadcast 
july 2017 by jm
A Brief History of the UUID · Segment Blog
This is great, by Rick Branson. I didn't realise UUIDs came from Apollo
history  distributed  distcomp  uuids  ids  coding  apollo  unix 
june 2017 by jm
youtube/doorman
Doorman is a solution for Global Distributed Client Side Rate Limiting. Clients that talk to a shared resource (such as a database, a gRPC service, a RESTful API, or whatever) can use Doorman to voluntarily limit their use (usually in requests per second) of the resource. Doorman is written in Go and uses gRPC as its communication protocol. For some high-availability features it needs a distributed lock manager. We currently support etcd, but it should be relatively simple to make it use Zookeeper instead.


From google -- very interesting to see they're releasing this as open source, and it doesn't rely on G-internal services
distributed  distcomp  locking  youtube  golang  doorman  rate-limiting  rate-limits  limits  grpc  etcd 
july 2016 by jm
Lamport timestamps
'The algorithm of Lamport timestamps is a simple algorithm used to determine the order of events in a distributed computer system. As different nodes or processes will typically not be perfectly synchronized, this algorithm is used to provide a partial ordering of events with minimal overhead, and conceptually provide a starting point for the more advanced vector clock method. They are named after their creator, Leslie Lamport.'

See also vector clocks (which I think would be generally preferable nowadays).
vector-clocks  distributed  programming  algorithm  clocks  time  leslie-lamport  coding  distcomp 
may 2016 by jm
How to do distributed locking
A critique of the "Redlock" locking algorithm from Redis by Martin Kleppman. antirez responds here: http://antirez.com/news/101 ; summary of followups: https://storify.com/martinkl/redlock-discussion
distributed  locking  redis  algorithms  coding  distcomp  redlock  martin-kleppman  zookeeper 
february 2016 by jm
muxy
a proxy that mucks with your system and application context, operating at Layers 4 and 7, allowing you to simulate common failure scenarios from the perspective of an application under test; such as an API or a web application. If you are building a distributed system, Muxy can help you test your resilience and fault tolerance patterns.
proxy  distributed  testing  web  http  fault-tolerance  failure  injection  tcp  delay  resilience  error-handling 
september 2015 by jm
GitTorrent
'A Decentralized GitHub'. nifty
distributed  git  github  bittorrent  bitcoin  gittorrent  dvcs 
may 2015 by jm
"Trash Day: Coordinating Garbage Collection in Distributed Systems"
Another GC-coordination strategy, similar to Blade (qv), with some real-world examples using Cassandra
blade  via:adriancolyer  papers  gc  distsys  algorithms  distributed  java  jvm  latency  spark  cassandra 
may 2015 by jm
You Cannot Have Exactly-Once Delivery
Cut out and keep:
Within the context of a distributed system, you cannot have exactly-once message delivery. Web browser and server? Distributed. Server and database? Distributed. Server and message queue? Distributed. You cannot have exactly-once delivery semantics in any of these situations.
distributed  distcomp  exactly-once-delivery  networking  outages  network-partitions  byzantine-generals  reference 
march 2015 by jm
"Incremental Stream Processing using Computational Conflict-free Replicated Data Types" [paper]
'Unlike existing alternatives, such as stream processing, that favor the execution of arbitrary application code, we want to capture much of the processing logic as a set of known operations over specialized Computational CRDTs, with particular semantics and invariants, such as min/max/average/median registers, accumulators, top-N sets, sorted sets/maps, and so on. Keeping state also allows the system to decrease the amount of propagated information. Preliminary results obtained in a single example show that Titan has an higher throughput when compared with state of the art stream processing systems.'
crdt  distributed  stream-processing  replication  titan  papers 
january 2015 by jm
Consul case study from Hootsuite
Hootsuite used Consul for distributed configuration, specifically dark-launch feature flags, with great results:

'Trying out bleeding edge software can be a risky proposition, but in the case of Consul, we’ve found it to be a solid system that works basically as described and was easy to get up and running. We managed to go from initial investigations to production within a month. The value was immediately obvious after looking into the key-value store combined with the events system and it’s DNS features and each of these has worked how we expected. Overall it has been fun to work with and has worked well and based on the initial work we have done with the Dark Launching system we’re feeling confident in Consul’s operation and are looking forward to expanding the scope of it’s use.'
consul  dark-launches  feature-flags  configuration  distributed  hootsuite  notification 
november 2014 by jm
The Saga pattern
'a distribution of long-living [distributed] transactions where steps may interleave, each with associated compensating transactions providing a compensation path across databases in the occurrence of a fault that may or may not compensate the entire chain back to the originator.'
distributed  messaging  saga  patterns  architecture  transactions  distributed-transactions  distcomp 
october 2014 by jm
"Ark: A Real-World Consensus Implementation" [paper]
"an implementation of a consensus algorithm similar to Paxos and Raft, designed as an improvement over the existing consensus algorithm used by MongoDB and TokuMX."

It'll be interesting to see how this gets on in review from the distributed-systems community. The phrase "similar to Paxos and Raft" is both worrying and promising ;)
paxos  raft  consensus  algorithms  distsys  distributed  leader-election  mongodb  tokumx 
july 2014 by jm
"Replicated abstract data types: Building blocks for collaborative applications"
cited at https://news.ycombinator.com/item?id=7737423 as 'one of my favorite papers on CRDTs and provides practical pseudocode for learning how to implement CRDTs yourself', in a discussion on cemerick's "Distributed Systems and the End of the API": http://writings.quilt.org/2014/05/12/distributed-systems-and-the-end-of-the-api/
distcomp  networking  distributed  crdts  algorithms  text  data-structures  cap 
may 2014 by jm
'Pickles & Spores: Improving Support for Distributed Programming in Scala
'Spores are "small units of possibly mobile functional behavior". They're a closure-like abstraction meant for use in distributed or concurrent environments. Spores provide a guarantee that the environment is effectively immutable, and safe to ship over the wire. Spores aim to give library authors some confidence in exposing functions (or, rather, spores) in public APIs for safe consumption in a distributed or concurrent environment.

The first part of the talk covers a simpler variant of spores as they are proposed for inclusion in Scala 2.11. The second part of the talk briefly introduces a current research project ongoing at EPFL which leverages Scala's type system to provide type constraints that give authors finer-grained control over spore capturing semantics. What's more, these type constraints can be composed during spore composition, so library authors are effectively able to propagate expert knowledge via these composable constraints.

The last part of the talk briefly covers Scala/Pickling, a fast new, open serialization framework.'
pickling  scala  presentations  spores  closures  fp  immutability  coding  distributed  distcomp  serialization  formats  network 
april 2014 by jm
Scalable Atomic Visibility with RAMP Transactions
Great new distcomp protocol work from Peter Bailis et al:
We’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are. [...]

How they work: RAMP transactions allow readers and writers to proceed concurrently. Operations race, but readers autonomously detect the races and repair any non-atomic reads. The write protocol ensures readers never stall waiting for writes to arrive.

Why they scale: Clients can’t cause other clients to stall (via synchronization independence) and clients only have to contact the servers responsible for items in their transactions (via partition independence). As a consequence, there’s no mutual exclusion or synchronous coordination across servers.

The end result: RAMP transactions outperform existing approaches across a variety of workloads, and, for a workload of 95% reads, RAMP transactions scale to over 7 million ops/second on 100 servers at less than 5% overhead.
scale  synchronization  databases  distcomp  distributed  ramp  transactions  scalability  peter-bailis  protocols  sharding  concurrency  atomic  partitions 
april 2014 by jm
Blockade
'Testing applications under slow or flaky network conditions can be difficult and time consuming. Blockade aims to make that easier. A config file defines a number of docker containers and a command line tool makes introducing controlled network problems simple.'

Open-source release from Dell's Cloud Manager team (ex-Enstratius), inspired by aphyr's Jepsen. Simulates packet loss using "tc netem", so no ability to e.g. drop packets on certain flows or certain ports. Still, looks very usable -- great stuff.
testing  docker  networking  distributed  distcomp  enstratius  jepsen  network  outages  partitions  cap  via:lusis 
february 2014 by jm
The trouble with timestamps
Timestamps, as implemented in Riak, Cassandra, et al, are fundamentally unsafe ordering constructs. In order to guarantee consistency you, the user, must ensure locally monotonic and, to some extent, globally monotonic clocks. This is a hard problem, and NTP does not solve it for you. When wall clocks are not properly coupled to the operations in the system, causal constraints can be violated. To ensure safety properties hold all the time, rather than probabilistically, you need logical clocks.
clocks  time  distributed  databases  distcomp  ntp  via:fanf  aphyr  vector-clocks  last-write-wins  lww  cassandra  riak 
october 2013 by jm
Non-blocking transactional atomicity
interesting new distributed atomic transaction algorithm from Peter Bailis
algorithms  database  distributed  scalability  storage  peter-bailis  distcomp 
october 2013 by jm
Non-blocking transactional atomicity
Peter Bailis with an interesting distributed-storage atomicity algorithm for performing multi-record transactional updates
algorithms  nbta  transactions  databases  storage  distcomp  distributed  atomic  coding  eventual-consistency  crdts 
september 2013 by jm
_In Search of an Understandable Consensus Algorithm_, Diego Ongaro and John Ousterhout, Stanford

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election and log replication, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety. Results from a user study demonstrate that Raft is easier for students to learn than Paxos.
distributed  algorithms  paxos  raft  consensus-algorithms  distcomp  leader-election  replication  clustering 
august 2013 by jm
etcd
A highly-available key value store for shared configuration and service discovery. etcd is inspired by zookeeper and doozer, with a focus on:

Simple: curl'able user facing API (HTTP+JSON);
Secure: optional SSL client cert authentication;
Fast: benchmarked 1000s of writes/s per instance;
Reliable: Properly distributed using Raft;

Etcd is written in go and uses the raft consensus algorithm to manage a highly availably replicated log.

One of the core components of CoreOS -- http://coreos.com/ .
configuration  distributed  raft  ha  doozer  zookeeper  go  replication  consensus-algorithm  etcd  coreos 
august 2013 by jm
The CAP FAQ by henryr
No subject appears to be more controversial to distributed systems engineers than the oft-quoted, oft-misunderstood CAP theorem. The purpose of this FAQ is to explain what is known about CAP, so as to help those new to the theorem get up to speed quickly, and to settle some common misconceptions or points of disagreement.
database  distributed  nosql  cap  consistency  cap-theorem  faqs 
june 2013 by jm
Call me maybe: Carly Rae Jepsen and the perils of network partitions
Kyle "aphyr" Kingsbury expands on his slides demonstrating the real-world failure scenarios that arise during some kinds of partitions (specifically, the TCP-hang, no clear routing failure, network partition scenario). Great set of blog posts clarifying CAP
distributed  network  databases  cap  nosql  redis  mongodb  postgresql  riak  crdt  aphyr 
may 2013 by jm
Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node
an excellent writeup on Kafka 0.8's use and operation, including details of the new replication features
kafka  replication  queueing  distributed  ops 
april 2013 by jm
CouchDB: not drinking the kool-aid
Jonathan Ellis on some CouchDB negatives:
Here are some reasons you should think twice and do careful testing before using CouchDB in a non-toy project:
Writes are serialized.  Not serialized as in the isolation level, serialized as in there can only be one write active at a time.  Want to spread writes across multiple disks?  Sorry.
CouchDB uses a MVCC model, which means that updates and deletes need to be compacted for the space to be made available to new writes.  Just like PostgreSQL, only without the man-years of effort to make vacuum hurt less.
CouchDB is simple.  Gloriously simple.  Why is that a negative?  It's competing with systems (in the popular imagination, if not in its author's mind) that have been maturing for years.  The reason PostgreSQL et al have those features is because people want them.  And if you don't, you should at least ask a DBA with a few years of non-MySQL experience what you'll be missing.  The majority of CouchDB fans don't appear to really understand what a good relational database gives them, just as a lot of PHP programmers don't get what the big deal is with namespaces.
A special case of simplicity deserves mention: nontrivial queries must be created as a view with mapreduce.  MapReduce is a great approach to trivially parallelizing certain classes of problem.  The problem is, it's tedious and error-prone to write raw MapReduce code.  This is why Google and Yahoo have both created high-level languages on top of it (Sawzall and Pig, respectively).  Poor SQL; even with DSLs being the new hotness, people forget that SQL is one of the original domain-specific languages.  It's a little verbose, and you might be bored with it, but it's much better than writing low-level mapreduce code.
cassandra  couch  nosql  storage  distributed  databases  consistency 
april 2013 by jm
Netflix Curator
a high-level API that greatly simplifies using ZooKeeper. It adds many features that build on ZooKeeper and handles the complexity of managing connections to the ZooKeeper cluster and retrying operations. Some of the features are:

Automatic connection management: There are potential error cases that require ZooKeeper clients to recreate a connection and/or retry operations. Curator automatically and transparently (mostly) handles these cases.

Cleaner API: simplifies the raw ZooKeeper methods, events, etc.; provides a modern, fluent interface

Recipe implementations (see Recipes): Leader election, Shared lock, Path cache and watcher, Distributed Queue, Distributed Priority Queue
zookeeper  java  netflix  distcomp  libraries  oss  open-source  distributed 
march 2013 by jm
Marc Brooker's "two-randoms" load balancing approach
Marc Brooker on this interesting load-balancing algorithm, including simulation results:
Using stale data for load balancing leads to a herd behavior, where requests will herd toward a previously quiet host for much longer than it takes to make that host very busy indeed. The next refresh of the cached load data will put the server high up the load list, and it will become quiet again. Then busy again as the next herd sees that it's quiet. Busy. Quiet. Busy. Quiet. And so on. One possible solution would be to give up on load balancing entirely, and just pick a host at random. Depending on the load factor, that can be a good approach. With many typical loads, though, picking a random host degrades latency and reduces throughput by wasting resources on servers which end up unlucky and quiet.

The approach taken by the studies surveyed by Mitzenmacher is to try two hosts, and pick the one with the least load. This can be done directly (by querying the hosts) but also works surprisingly well on cached load data. [...] Best of 2 is good because it combines the best of both worlds: it uses real information about load to pick a host (unlike random), but rejects herd behavior much more strongly than the other two approaches.


Having seen what Marc has worked on, and written, inside Amazon, I'd take this very seriously... cool to see he is blogging externally too.
algorithm  load-balancing  distcomp  distributed  two-randoms  marc-brooker  least-conns 
february 2013 by jm
Timelike 2: everything fails all the time
Fantastic post on large-scale distributed load balancing strategies from @aphyr. Random and least-conns routing comes out on top in his simulation (although he hasn't yet tried Marc Brooker's two-randoms routing strategy)
via:hn  routing  distributed  least-conns  load-balancing  round-robin  distcomp  networking  scaling 
february 2013 by jm
Distributed Streams Algorithms for Sliding Windows [PDF]
'Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate
functions over a “sliding window” of the N most recent data items in one or more streams. [...] Our results are obtained using a novel family of synopsis data structures called waves.'
waves  papers  streaming  algorithms  percentiles  histogram  distcomp  distributed  aggregation  statistics  estimation  streams 
february 2013 by jm
Heroku finds out that distributed queueing is hard
Stage 3 of the Rap Genius/Heroku blog drama. Summary (as far as I can tell): Heroku gave up on a fully-synchronised load-balancing setup ("intelligent routing"), since it didn't scale, in favour of randomised queue selection; they didn't sufficiently inform their customers, and metrics and docs were not updated to make this change public; the pessimal case became pretty damn pessimal; a customer eventually noticed and complained publicly, creating a public shit-storm.

Comments: 1. this is why you monitor real HTTP request latency (scroll down for crazy graphs!). 2. include 90/99 percentiles to catch the "tail" of poorly-performing requests. 3. Load balancers are hard.

http://aphyr.com/posts/277-timelike-a-network-simulator has more info on the intricacies of distributed load balancing -- worth a read.
heroku  rap-genius  via:hn  networking  distcomp  distributed  load-balancing  ip  queueing  percentiles  monitoring 
february 2013 by jm
Clairvoyant Squirrel: Large Scale Malicious Domain Classification
Storm-based service to detect malicious DNS domain usage from streaming pcap data in near-real-time. Uses string features in the DNS domain, along with randomness metrics using Markov analysis, combined with a Random Forest classifier, to achieve 98% precision at 10,000 matches/sec
storm  distributed  distcomp  random-forest  classifiers  machine-learning  anti-spam  slides 
february 2013 by jm
Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm in Storm
Storm demo with a reasonably complex topology.
'how to implement a distributed, real-time trending topics algorithm in Storm. It uses the latest features available in Storm 0.8 (namely tick tuples) and should be a good starting point for anyone trying to implement such an algorithm for their own application. The new code is now available in the official storm-starter repository, so feel free to take a deeper look.'
storm  distcomp  distributed  tick-tuples  demo 
january 2013 by jm
Notes on Distributed Systems for Young Bloods
'Below is a list of some lessons I’ve learned as a distributed systems engineer that are worth being told to a new engineer. Some are subtle, and some are surprising, but none are controversial. This list is for the new distributed systems engineer to guide their thinking about the field they are taking on. It’s not comprehensive, but it’s a good beginning.' This is a pretty nice list, a little over-stated, but that's the format. I particularly like the following: 'Exploit data-locality'; 'Learn to estimate your capacity'; 'Metrics are the only way to get your job done'; 'Use percentiles, not averages'; 'Extract services'.
systems  distributed  distcomp  cap  metrics  coding  guidelines  architecture  backpressure  design  twitter 
january 2013 by jm
Trident: a high-level abstraction for realtime computation
built on Storm:

Trident is a new high-level abstraction for doing realtime computing on top of Twitter Storm, available in Storm 0.8.0. It allows you to seamlessly mix high throughput (millions of messages per second), stateful stream processing with low latency distributed querying. If you're familiar with high level batch processing tools like Pig or Cascading, the concepts of Trident will be very familiar - Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.
distributed  realtime  twitter  storm  trident  distcomp  stream-processing  low-latency  nathan-marz 
october 2012 by jm
Spanner: Google's Globally-Distributed Database [PDF]

Abstract: Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

To appear in:
OSDI'12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October, 2012.
database  distributed  google  papers  toread  pdf  scalability  distcomp  transactions  cap  consistency 
september 2012 by jm
Fault Tolerance in a High Volume, Distributed System
Netflix's "DependencyCommand", a resiliency system for SOA inter-service network calls, offering builtin support for threadpools, timeouts, retries and graceful failover. Very nice
netflix  architecture  concurrency  distributed  failover  ha  resiliency  fail-fast  failsafe  soa  fault-tolerance 
march 2012 by jm
Apache Kafka
'Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.' neat
kafka  linkedin  apache  distributed  messaging  pubsub  queue  incubator  scaling 
february 2012 by jm
Hacker News thread on Storm
lots of good questions and answers in here
twitter  storm  distcomp  distributed 
september 2011 by jm
Storm: distributed and fault-tolerant realtime computation
intro slideshow to this really nifty-looking distcomp platform
distcomp  distributed  realtime  storm  slides  twitter 
september 2011 by jm
Introduction to parallel & distributed algorithms
really interesting parallel algorithm concepts. I'd seen parallel merge sort before from the map-reduce world, but some others are new to me and worth thinking about (via Hacker News)
via:hackernews  algorithms  distributed  parallel  map-reduce  merge-sort  sorting  from delicious
august 2010 by jm
ElasticSearch
nifty; Apache-licensed distributed, RESTful, JSON-over-HTTP, schemaless search server with multi-tenancy
search  distributed  rest  json  apache  elasticsearch  http  from delicious
february 2010 by jm
SD, a distributed bug tracker
now available. sadly, no support for Bugzilla, which is what we use in SpamAssassin (srsly), so I won't be trying it out just yet, but still -- cool
bugs  bug-tracking  trac  prophet  distributed  coding  tools  web  sd 
august 2009 by jm

related tags

aggregation  algorithm  algorithms  anti-spam  apache  aphyr  apollo  architecture  atomic  backpressure  bitcoin  bittorrent  blade  broadcast  bug-tracking  bugs  byzantine-generals  cap  cap-theorem  cassandra  classifiers  clocks  closures  clustering  coding  concurrency  configuration  consensus  consensus-algorithm  consensus-algorithms  consistency  consul  coreos  couch  crdt  crdts  curator  dark-launches  data-structures  database  databases  delay  demo  design  devops  distcomp  distributed  distributed-transactions  distsys  docker  doorman  doozer  dvcs  elasticsearch  enstratius  erlang  error-handling  estimation  etcd  eventual-consistency  exactly-once-delivery  fail-fast  failover  failsafe  failure  faqs  fault-tolerance  feature-flags  formats  fp  gc  git  github  gittorrent  go  golang  google  grpc  guidelines  ha  heroku  histogram  history  hootsuite  http  ids  immutability  incubator  injection  ip  java  jay-kreps  jepsen  json  jvm  kafka  last-write-wins  latency  leader-election  least-conns  leslie-lamport  libraries  limits  linkedin  load-balancing  locking  low-latency  lww  machine-learning  map-reduce  marc-brooker  martin-kleppman  merge-sort  messaging  metrics  mongodb  monitoring  nathan-marz  nbta  netflix  network  network-partitions  networking  nosql  notification  ntp  open-source  ops  oss  outages  papers  parallel  partitions  patterns  paxos  pdf  percentiles  peter-bailis  pickling  postgresql  presentations  programming  prophet  protocols  proxy  pubsub  queue  queueing  raft  ramp  random-forest  rap-genius  rate-limiting  rate-limits  realtime  redis  redlock  reference  reliability  replication  resilience  resiliency  rest  riak  round-robin  routing  saga  scala  scalability  scale  scaling  sd  search  serialization  server-side  sharding  slides  soa  sorting  spark  spores  statistics  storage  storm  stream-processing  streaming  streams  synchronization  systems  tcp  testing  text  tick-tuples  time  titan  tokumx  tools  toread  trac  transactions  trident  twitter  two-randoms  unix  uuids  vector-clocks  via:adriancolyer  via:fanf  via:hackernews  via:hn  via:lusis  visualization  voldemort  waves  web  youtube  zookeeper 

Copy this bookmark:



description:


tags: