jm + marc-brooker   8

Jepsen: Hazelcast 3.8.3
Not a very good review of Hazelcast's CAP behaviour from Aphyr. see also for more musings from Marc Brooker on the topic ("PA/EC is a confusing and dangerous behaviour for many cases")
jepsen  aphyr  testing  hazelcast  cap-theorem  reliability  partitions  network  pacelc  marc-brooker 
7 days ago by jm
Exponential Backoff And Jitter
Great go-to explainer blog post for this key distributed-systems reliability concept, from the always-solid Marc Brooker
marc-brooker  distsys  networking  backoff  exponential  jitter  retrying  retries  reliability  occ 
march 2015 by jm
A Quiet Defense of Patterns
Marc Brooker: 'When it comes to building working software in the long term, the emotional pursuit of craft is not as important as the human pursuit of teamwork, or the intellectual pursuit of correctness. Patterns is one of the most powerful ideas we have. The critics may be right that it devalues the craft, but we would all do well to remember that the craft of software is a means, not an end.'
marc-brooker  design-patterns  coding  software  teamwork 
february 2015 by jm
Exactly-Once Delivery May Not Be What You Want
An extremely good explanation from Marc Brooker that exactly-once delivery in a distributed system is very hard.
And so on. There's always a place to slot in one more turtle. The bad news is that I'm not aware of a nice solution to the general problem for all side effects, and I suspect that no such solution exists. On the bright side, there are some very nice solutions that work really well in practice. The simplest is idempotence. This is a very simple idea: we make the tasks have the same effect no matter how many times they are executed.
architecture  messaging  queues  exactly-once-delivery  reliability  fault-tolerance  distcomp  marc-brooker 
november 2014 by jm
Use of Formal Methods at Amazon Web Services
Chris Newcombe, Marc Brooker, et al. writing about their experience using formal specification and model-checking languages (TLA+) in production in AWS:

The success with DynamoDB gave us enough evidence to present TLA+ to the broader engineering community at Amazon. This raised a challenge; how to convey the purpose and benefits of formal methods to an audience of software engineers? Engineers think in terms of debugging rather than ‘verification’, so we called the presentation “Debugging Designs”.

Continuing that metaphor, we have found that software engineers more readily grasp the concept and practical value of TLA+ if we dub it 'Exhaustively-testable pseudo-code'.

We initially avoid the words ‘formal’, ‘verification’, and ‘proof’, due to the widespread view that formal methods are impractical. We also initially avoid mentioning what the acronym ‘TLA’ stands for, as doing so would give an incorrect impression of complexity.

More slides at ; proggit discussion at
formal-methods  model-checking  tla  tla+  programming  distsys  distcomp  ebs  s3  dynamodb  aws  ec2  marc-brooker  chris-newcombe 
june 2014 by jm
Are volatile reads really free?
Marc Brooker with some good test data:
It appears as though reads to volatile variables are not free in Java on x86, or at least on the tested setup. It's true that the difference isn't so huge (especially for the read-only case) that it'll make a difference in any but the more performance sensitive case, but that's a different statement from free.
volatile  concurrency  jvm  performance  java  marc-brooker 
february 2013 by jm
Marc Brooker's "two-randoms" load balancing approach
Marc Brooker on this interesting load-balancing algorithm, including simulation results:
Using stale data for load balancing leads to a herd behavior, where requests will herd toward a previously quiet host for much longer than it takes to make that host very busy indeed. The next refresh of the cached load data will put the server high up the load list, and it will become quiet again. Then busy again as the next herd sees that it's quiet. Busy. Quiet. Busy. Quiet. And so on. One possible solution would be to give up on load balancing entirely, and just pick a host at random. Depending on the load factor, that can be a good approach. With many typical loads, though, picking a random host degrades latency and reduces throughput by wasting resources on servers which end up unlucky and quiet.

The approach taken by the studies surveyed by Mitzenmacher is to try two hosts, and pick the one with the least load. This can be done directly (by querying the hosts) but also works surprisingly well on cached load data. [...] Best of 2 is good because it combines the best of both worlds: it uses real information about load to pick a host (unlike random), but rejects herd behavior much more strongly than the other two approaches.

Having seen what Marc has worked on, and written, inside Amazon, I'd take this very seriously... cool to see he is blogging externally too.
algorithm  load-balancing  distcomp  distributed  two-randoms  marc-brooker  least-conns 
february 2013 by jm

Copy this bookmark: