jm + stacks   4

Observability at Twitter: technical overview, part II
Interesting to me mainly for this tidbit which makes my own prejudices:
“Pull” vs “push” in metrics collection: At the time of our previous blog post, all our metrics were collected by “pulling” from our collection agents. We discovered two main issues:

* There is no easy way to differentiate service failures from collection agent failures. Service response time out and missed collection request are both manifested as empty time series.
* There is a lack of service quality insulation in our collection pipeline. It is very difficult to set an optimal collection time out for various services. A long collection time from one single service can cause a delay for other services that share the same collection agent.

In light of these issues, we switched our collection model from “pull” to “push” and increased our service isolation. Our collection agent on each host only collects metrics from services running on that specific host. Additionally, each collection agent sends separate collection status tracking metrics in addition to the metrics emitted by the services.

We have seen a significant improvement in collection reliability with these changes. However, as we moved to self service push model, it becomes harder to project the request growth. In order to solve this problem, we plan to implement service quota to address unpredictable/unbounded growth.
pull  push  metrics  tcp  stacks  monitoring  agents  twitter  fault-tolerance 
march 2016 by jm
Charity Majors - AWS networking, VPC, environments and you
'VPC is the future and it is awesome, and unless you have some VERY SPECIFIC AND CONVINCING reasons to do otherwise, you should be spinning up a VPC per environment with orchestration and prob doing it from CI on every code commit, almost like it’s just like, you know, code.'
networking  ops  vpc  aws  environments  stacks  terraform 
march 2016 by jm
Introducing Chronos: A Replacement for Cron
A distributed, fault-tolerant "cron" is something which comes up frequently -- it makes for a great fault-tolerance building block. This one sounds like it's too closely tied into Mesos, though (IMO).
Chronos is our replacement for cron. It is a distributed and fault-tolerant scheduler which runs on top of Mesos. It's a framework and supports custom mesos executors as well as the default command executor. Thus by default, Chronos executes SH (on most systems BASH) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the mesos slaves on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchroneous callbacks to notify Chronos of job completion or failures.
cron  scheduling  mesos  stacks  design  airbnb  chronos  fault-tolerance  distcomp  distributed-computing  scripts  jobs 
march 2013 by jm
Cloudsmith Stack Hammer
something Chris Horn sent on -- using Puppet to build stacks and deploy to AWS using a simple point-and-click interface. looks cool
github  ec2  aws  puppet  stacks  cloudsmith  stack-hammer  via:chorn 
february 2012 by jm

Copy this bookmark: