jm + service-discovery   12

newrelic/sidecar: Gossip-based service discovery. Docker native, but supports static discovery, too.
An AP gossip-based service-discovery sidecar process.
Services communicate to each other through an HAproxy instance on each host that is itself managed and configured by Sidecar. It is inspired by Airbnb's SmartStack. But, we believe it has a few advantages over SmartStack:

Native support for Docker (works without Docker, too!);
No dependence on Zookeeper or other centralized services;
Peer-to-peer, so it works on your laptop or on a large cluster;
Static binary means it's easy to deploy, and there is no interpreter needed;
Tiny memory usage (under 20MB) and few execution threads means its very light weight
clustering  docker  go  service-discovery  ap  sidecar  haproxy  discovery  architecture 
17 days ago by jm
Service discovery at Stripe
Writeup of their Consul-based service discovery system, a bit similar to smartstack. Good description of the production problems that they saw with Consul too, and also they figured out that strong consistency isn't actually what you want in a service discovery system ;)

HN comments are good too:
consul  api  microservices  service-discovery  dns  load-balancing  l7  tcp  distcomp  smartstack  stripe  cap-theorem  scalability 
november 2016 by jm
Librato's service discovery library using Zookeeper (so strongly consistent, but with the ZK downside that an AZ outage can stall service discovery updates region-wide)
zookeeper  service-discovery  librato  java  open-source  load-balancing 
october 2015 by jm
Baker Street
client-side 'service discovery and routing system for microservices' -- another Smartstack, then
python  router  smartstack  baker-street  microservices  service-discovery  routing  load-balancing  http 
october 2015 by jm
Why You Shouldn’t Use ZooKeeper for Service Discovery
In CAP terms, ZooKeeper is CP, meaning that it’s consistent in the face of partitions, not available. For many things that ZooKeeper does, this is a necessary trade-off. Since ZooKeeper is first and foremost a coordination service, having an eventually consistent design (being AP) would be a horrible design decision. Its core consensus algorithm, Zab, is therefore all about consistency. For coordination, that’s great. But for service discovery it’s better to have information that may contain falsehoods than to have no information at all. It is much better to know what servers were available for a given service five minutes ago than to have no idea what things looked like due to a transient network partition. The guarantees that ZooKeeper makes for coordination are the wrong ones for service discovery, and it hurts you to have them.

Yes! I've been saying this for months -- good to see others concurring.
architecture  zookeeper  eureka  outages  network-partitions  service-discovery  cap  partitions 
december 2014 by jm
Zookeeper: not so great as a highly-available service registry
Turns out ZK isn't a good choice as a service discovery system, if you want to be able to use that service discovery system while partitioned from the rest of the ZK cluster:
I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances.  This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones.  What I saw was that the two other instances noticed the first server “going away”, but they continued to function as they still saw a majority (66%).  More interestingly the first instance noticed the other two servers “going away”, dropping the ensemble availability to 33%.  This caused the first server to stop serving requests to clients (not only writes, but also reads).

So: within that offline AZ, service discovery *reads* (as well as writes) stopped working due to a lack of ZK quorum. This is quite a feasible outage scenario for EC2, by the way, since (at least when I was working there) the network links between AZs, and the links with the external internet, were not 100% overlapping.

In other words, if you want a highly-available service discovery system in the fact of network partitions, you want an AP service discovery system, rather than a CP one -- and ZK is a CP system.

Another risk, noted on the Netflix Eureka mailing list at :

ZooKeeper, while tolerant against single node failures, doesn't react well to long partitioning events. For us, it's vastly more important that we maintain an available registry than a necessarily consistent registry. If us-east-1d sees 23 nodes, and us-east-1c sees 22 nodes for a little bit, that's OK with us.

I guess this means that a long partition can trigger SESSION_EXPIRED state, resulting in ZK client libraries requiring a restart/reconnect to fix. I'm not entirely clear what happens to the ZK cluster itself in this scenario though.

Finally, Pinterest ran into other issues relying on ZK for service discovery and registration, described at ; sounds like this was mainly around load and the "thundering herd" overload problem. Their workaround was to decouple ZK availability from their services' availability, by building a Smartstack-style sidecar daemon on each host which tracked/cached ZK data.
zookeeper  service-discovery  ops  ha  cap  ap  cp  service-registry  availability  ec2  aws  network  partitions  eureka  smartstack  pinterest 
november 2014 by jm
Building a Smarter Application Stack - DevOps Ireland
This sounds like a very interesting Dublin meetup -- Engine Yard on thursday night:
This month, we'll have Tomas Doran from Yelp talking about Docker, service discovery, and deployments. 'There are many advantages to a container based, microservices architecture - however, as always, there is no silver bullet. Any serious deployment will involve multiple host machines, and will have a pressing need to migrate containers between hosts at some point. In such a dynamic world hard coding IP addresses, or even host names is not a viable solution. This talk will take a journey through how Yelp has solved the discovery problems using Airbnb’s SmartStack to dynamically discover service dependencies, and how this is helping unify our architecture, from traditional metal to EC2 ‘immutable’ SOA images, to Docker containers.'
meetups  talks  dublin  deployment  smartstack  ec2  docker  yelp  service-discovery 
june 2014 by jm
Building a Global, Highly Available Service Discovery Infrastructure with ZooKeeper
This is the written version of a presentation [Camille Fournier] made at the ZooKeeper Users Meetup at Strata/Hadoop World in October, 2012 (slides available here). This writeup expects some knowledge of ZooKeeper.

good advice from one of the ZK committers.
zookeeper  service-discovery  architecture  distcomp  camille-fournier  availability  wan  network 
may 2014 by jm
Nice-looking new tool from Hashicorp; service discovery and configuration service, built on Raft for leader election, Serf for gossip-based messaging, and Go. Some features:

* Gossip is performed over both TCP and UDP;

* gossip messages are encrypted symmetrically and therefore secure from eavesdropping, tampering, spoofing and packet corruption (like the incident which brought down S3 for days: );

* exposes both a HTTP interface and (even better) DNS;

* includes explicit support for long-distance WAN operation as well as on LANs.

It all looks very practical and usable. MPL-licensed.

The only potential risk I can see is that expecting to receive config updates from a blocking poll of the HTTP interface needs some good "best practice" docs, to ensure that people don't mishandle the scenario where there is a network partition between your calling code and the Consul server/agent. Without any heartbeating protocol behind the scenes, HTTP is vulnerable to "hung connections" which would result in a config change being silently missed by the client until the connection eventually is timed out, either by the calling code or the client-side kernel. This could potentially take minutes to occur, which in some usage scenarios could be a big, unforeseen problem.
configuration  service-discovery  distcomp  raft  consensus-algorithms  go  mpl  open-source  dns  http  gossip-protocol  hashicorp 
april 2014 by jm
Amazon Route 53 Infima
Colm McCarthaigh has open sourced Infima, 'a library for managing service-level fault isolation using Amazon Route 53'.
Infima provides a Lattice container framework that allows you to categorize each endpoint along one or more fault-isolation dimensions such as availability-zone, software implementation, underlying datastore or any other common point of dependency endpoints may share.

Infima also introduces a new ShuffleShard sharding type that can exponentially increase the endpoint-level isolation between customer/object access patterns or any other identifier you choose to shard on.

Both Infima Lattices and ShuffleShards can also be automatically expressed in Route 53 DNS failover configurations using AnswerSet and RubberTree.
infima  colmmacc  dns  route-53  fault-tolerance  failover  multi-az  sharding  service-discovery 
november 2013 by jm
'a service discovery and orchestration tool that is decentralized, highly available, and fault tolerant. Serf runs on every major platform: Linux, Mac OS X, and Windows. It is extremely lightweight: it uses 5 to 10 MB of resident memory and primarily communicates using infrequent UDP messages [and an] efficient gossip protocol.'
clustering  service-discovery  ops  linux  gossip  broadcast  clusters 
november 2013 by jm
Airbnb's Smartstack
Service discovery a la Airbnb -- Nerve and Synapse: two external daemons that run on each node, Nerve to manage registration in Zookeeper, and Synapse to generate a haproxy configuration file from that, running on each host, allowing connections to all other hosts.
haproxy  services  ops  load-balancing  service-discovery  nerve  synapse  airbnb 
october 2013 by jm

Copy this bookmark: