jm + soa   24

The Death of Microservice Madness in 2018
Quite a good set of potential gotchas, which I've run into myself, including:

'Real world systems often have poorly defined boundaries'
'The complexities of state are often ignored'
'The complexitities of communication are often ignored'
'Versioning can be hard'
'Microservices can be monoliths in disguise'
architecture  devops  microservices  services  soa  coding  monoliths  state  systems 
6 days ago by jm
Should create a separate Hystrix Thread pool for each remote call?
Excellent advice on capacity planning and queueing theory, in the context of Hystrix. Should I use a single thread pool for all dependency callouts, or independent thread pools for each one?
threadpools  pooling  hystrix  capacity  queue-theory  queueing  queues  failure  resilience  soa  microservices 
may 2016 by jm
How Facebook avoids failures
Great paper from Ben Maurer of Facebook in ACM Queue.
A "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves.

This is full of interesting techniques.

* Rapidly deployed configuration changes: Make everybody use a common configuration system; Statically validate configuration changes; Run a canary; Hold on to good configurations; Make it easy to revert.

* Hard dependencies on core services: Cache data from core services. Provide hardened APIs. Run fire drills.

* Increased latency and resource exhaustion: Controlled Delay (based on the anti-bufferbloat CoDel algorithm -- this is really cool); Adaptive LIFO (last-in, first-out) for queue busting; Concurrency Control (essentially a form of circuit breaker).

* Tools that Help Diagnose Failures: High-Density Dashboards with Cubism (horizon charts); What just changed?

* Learning from Failure: the DERP (!) methodology,
ben-maurer  facebook  reliability  algorithms  codel  circuit-breakers  derp  failure  ops  cubism  horizon-charts  charts  dependencies  soa  microservices  uptime  deployment  configuration  change-management 
november 2015 by jm
Diffy: Testing services without writing tests
Play requests against 2 versions of a service. A fair bit more complex than simply replaying logged requests, which took 10 lines of a shell script last time I did it
http  testing  thrift  automation  twitter  diffy  diff  soa  tests 
september 2015 by jm
toxy is a fully programmatic and hackable HTTP proxy to simulate server failure scenarios and unexpected network conditions. It was mainly designed for fuzzing/evil testing purposes, when toxy becomes particularly useful to cover fault tolerance and resiliency capabilities of a system, especially in service-oriented architectures, where toxy may act as intermediate proxy among services.

toxy allows you to plug in poisons, optionally filtered by rules, which essentially can intercept and alter the HTTP flow as you need, performing multiple evil actions in the middle of that process, such as limiting the bandwidth, delaying TCP packets, injecting network jitter latency or replying with a custom error or status code.
toxy  proxies  proxy  http  mitm  node.js  soa  network  failures  latency  slowdown  jitter  bandwidth  tcp 
august 2015 by jm
'Microservice AntiPatterns'
presentation from last week's Craft Conference in Budapest; Tammer Saleh of Pivotal with a few antipatterns observed in dealing with microservices.
microservices  soa  architecture  design  coding  software  presentations  slides  tammer-saleh  pivotal  craft 
april 2015 by jm
Most page loads will experience the 99th percentile response latency
MOST of the page view attempts will experience the 99%'lie server response time in modern web applications. You didn't read that wrong.
latency  metrics  percentiles  p99  web  http  soa 
october 2014 by jm
Is Docker ready for production? Feedbacks of a 2 weeks hands on
I have to agree with this assessment -- there are a lot of loose ends still for production use of Docker in a SOA stack environment:
From my point of view, Docker is probably the best thing I’ve seen in ages to automate a build. It allows to pre build and reuse shared dependencies, ensuring they’re up to date and reducing your build time. It avoids you to either pollute your Jenkins environment or boot a costly and slow Virtualbox virtual machine using Vagrant. But I don’t feel like it’s production ready in a complex environment, because it adds too much complexity. And I’m not even sure that’s what it was designed for.
docker  complexity  devops  ops  production  deployment  soa  web-services  provisioning  networking  logging 
october 2014 by jm
The "sidecar" pattern
Ha, great name. We use this (in the form of Smartstack).
For what it is worth, we faced a similar challenge in earlier services (mostly due to existing C/C++ applications) and we created what was called a "sidecar".  By sidecar, what I mean is a second process on each node/instance that did Cloud Service Fabric operations on behalf of the main process (the side-managed process).  Unfortunately those sidecars all went off and created one-offs for their particular service.  In this post, I'll describe a more general sidecar that doesn't force users to have these one-offs.

Sidenote:  For those not familiar with sidecars, think of the motorcycle sidecar below.  Snoopy would be the main process with Woodstock being the sidecar process.  The main work on the instance would be the motorcycle (say serving your users' REST requests).  The operational control is the sidecar (say serving health checks and management plane requests of the operational platform).
netflix  sidecars  architecture  patterns  smartstack  netflixoss  microservices  soa 
august 2014 by jm
Richard Clayton - Failing at Microservices
Solid warts-and-all confessional blogpost about a team failing to implement a microservices architecture. I'd put most of the blame on insufficient infrastructure to support them (at a code level), inter-personal team problems, and inexperience with large-scale complex multi-service production deployment and the work it was going to require
microservices  devops  collaboration  architecture  fail  team  deployment  soa 
august 2014 by jm
Microservices - Not a free lunch! - High Scalability
Some good reasons not to adopt microservices blindly. Testability and distributed-systems complexity are my biggest fears
microservices  soa  devops  architecture  testing  distcomp 
august 2014 by jm
"The Tail at Scale"
by Jeffrey Dean and Luiz Andre Barroso, Google. A selection of Google's architectural mechanisms used to defeat 99th-percentile latency spikes: hedged requests, tied requests, micro-partitioning, selective replication, latency-induced probation, canary requests.
google  architecture  distcomp  soa  http  partitioning  replication  latency  99th-percentile  canary-requests  hedged-requests 
july 2014 by jm
Microservices and nanoservices
A great reaction to Martin Fowler's "microservices" coinage, from Arnon Rotem-Gal-Oz:

'I guess it is easier to use a new name (Microservices) rather than say that this is what SOA actually meant'; 'these are the very principles of SOA before vendors pushed the [ESB] in the middle.'

Others have also chosen to define microservices slightly differently, as a service written in 10-100 LOC. Arnon's reaction:

“Nanoservice is an antipattern where a service is too fine-grained. A nanoservice is a service whose overhead (communications, maintenance, and so on) outweighs its utility.”

Having dealt with maintaining an over-fine-grained SOA stack in Amazon, I can only agree with this definition; it's easy to make things too fine-grained and create a raft of distributed-computing bugs and deployment/management complexity where there is no need to do so.
architecture  antipatterns  nanoservices  microservices  soa  services  design  esb 
march 2014 by jm
The Microservice Declaration of Independence
"Microservices" seems to be yet another term for SOA; small, decoupled, independently-deployed services, with well-defined public HTTP APIs. Pretty much all the services I've worked on over the past few years have been built in this style. Still, let's keep an eye on this concept anyway.

Another definition seems to be a more FP-style one: -- where the "microservice" does one narrowly-defined thing, and that alone.
microservices  soa  architecture  handwaving  http  services  web  deployment 
march 2014 by jm
"Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" [PDF]
Google paper describing the infrastructure they've built for cross-service request tracing (ie. "tracer requests"). Features: low code changes required (since they've built it into the internal protobuf libs), low performance impact, sampling, deployment across the ~entire production fleet, output visibility in minutes, and has been live in production for over 2 years. Excellent read
dapper  tracing  http  services  soa  google  papers  request-tracing  tracers  protobuf  devops 
march 2014 by jm
Yammer Engineering - Resiliency at Yammer
Not content with adding Hystrix (circuit breakers, threadpooling, request time limiting, metrics, etc.) to their entire SOA stack, they've made it incredibly configurable by hooking in a web-based configuration UI, allowing dynamic on-the-fly reconfiguration by their ops guys of the circuit breakers and threadpools in production. Mad stuff
hystrix  circuit-breakers  resiliency  yammer  ops  threadpools  soa  dynamic-configuration  archaius  netflix 
january 2014 by jm
Jeff Dean - Taming Latency Variability and Scaling Deep Learning [talk]
'what Jeff Dean and team have been up to at Google'. Reducing request latency in a network SOA architecture using backup requests, etc., via Ilya Grigorik
youtube  talks  google  low-latency  soa  architecture  distcomp  jeff-dean  networking 
november 2013 by jm
'a Java port of Twitter's Snowflake thrift service presented as an HTTP-based Dropwizard service'.
an HTTP-based service for generating unique ID numbers at high scale with some simple guarantees. supports returning ID numbers as: JSON and JSONP; Google's Protocol Buffers; Plain text.

At GE, we were more interested in the uncoordinated aspects of Snowflake than its throughput requirements, so HTTP was fine for our needs. We also exposed the core of Snowflake as an embeddable module so it can be directly integrated into our applications. We don't have the guarantees that the Snowflake-Zookeeper integration was providing, but that was also acceptable to us. In places where we really needed high throughput, we leveraged the snowizard-core embeddable module directly.

Odd OSS license, though -- BSDish?
java  open-source  ids  soa  services  snowflake  http 
august 2013 by jm
New Tweets per second record, and how | Twitter Blog
How Twitter scaled up massively in 3 years -- replacing Ruby with the JVM, adopting SOA and custom sharding. Good summary post, looking forward to more techie details soon
twitter  performance  scalability  jvm  ruby  soa  scaling 
august 2013 by jm
Paper: "Root Cause Detection in a Service-Oriented Architecture" [pdf]
LinkedIn have implemented an automated root-cause detection system:

This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean
average precision in finding root causes compared to baseline and current state-of-the-art methods.

This is a topic close to my heart after working on something similar for 3 years in Amazon!

Looks interesting, although (a) I would have liked to see more case studies and examples of "real world" outages it helped with; and (b) it's very much a machine-learning paper rather than a systems one, and there is no discussion of fault tolerance in the design of the detection system, which would leave me worried that in the case of a large-scale outage event, the system itself will disappear when its help is most vital. (This was a major design influence on our team's work.)

Overall, particularly given those 2 issues, I suspect it's not in production yet. Ours certainly was ;)
linkedin  soa  root-cause  alarming  correlation  service-metrics  machine-learning  graphs  monitoring 
june 2013 by jm
Stability Patterns and Antipatterns [slides]
Michael "Release It!" Nygard's slides from a recent O'Reilly event, discussing large-scale service reliability design patterns
michael-nygard  design-patterns  architecture  systems  networking  reliability  soa  slides  pdf 
may 2013 by jm
Fault Tolerance in a High Volume, Distributed System
Netflix's "DependencyCommand", a resiliency system for SOA inter-service network calls, offering builtin support for threadpools, timeouts, retries and graceful failover. Very nice
netflix  architecture  concurrency  distributed  failover  ha  resiliency  fail-fast  failsafe  soa  fault-tolerance 
march 2012 by jm

related tags

99th-percentile  alarming  algorithms  antipatterns  api  archaius  architecture  automation  bandwidth  ben-maurer  canary-requests  capacity  change-management  charts  circuit-breakers  codel  coding  collaboration  complexity  concurrency  configuration  correlation  craft  cubism  dapper  dependencies  deployment  derp  design  design-patterns  devops  diff  diffy  distcomp  distributed  distributed-systems  docker  dynamic-configuration  esb  facebook  fail  fail-fast  failover  failsafe  failure  failures  fault-tolerance  google  graphs  ha  handwaving  hedged-requests  horizon-charts  http  hystrix  ids  interop  java  jeff-dean  jitter  json  jvm  latency  linkedin  logging  low-latency  machine-learning  marshalling  metrics  michael-nygard  microservices  mitm  monitoring  monoliths  nanoservices  netflix  netflixoss  network  networking  node.js  open-source  ops  p99  papers  partitioning  patterns  pdf  percentiles  performance  pivotal  pooling  presentations  production  protobuf  protocols  provisioning  proxies  proxy  queue-theory  queueing  queues  reliability  replication  request-tracing  resilience  resiliency  root-cause  ruby  scalability  scaling  service-metrics  services  sidecars  slides  slowdown  smartstack  snowflake  soa  software  state  systems  talks  tammer-saleh  tcp  team  testing  tests  threadpools  thrift  toxy  tracers  tracing  twitter  uptime  via:marc  web  web-services  yammer  youtube 

Copy this bookmark: