jm + aws   145

Global Continuous Delivery with Spinnaker
Netflix' CD platform, post-Atlas. looks interesting
continuous-delivery  aws  netflix  cd  devops  ops  atlas  spinnaker 
10 days ago by jm
Awesome new mock DynamoDB implementation:
An implementation of Amazon's DynamoDB, focussed on correctness and performance, and built on LevelDB (well, @rvagg's awesome LevelUP to be precise). This project aims to match the live DynamoDB instances as closely as possible (and is tested against them in various regions), including all limits and error messages.

Why not Amazon's DynamoDB Local? Because it's too buggy! And it differs too much from the live instances in a number of key areas.

We use DynamoDBLocal in our tests -- the availability of that tool is one of the key reasons we have adopted Dynamo so heavily, since we can safely test our code properly with it. This looks even better.
dynamodb  testing  unit-tests  integration-testing  tests  ops  dynalite  aws  leveldb 
16 days ago by jm
Valid MFA token does not work during first 1am hour before daylight savings ends and second 1am hour starts · Issue #1611 · aws/aws-cli
Add another one to the "yay for DST" pile. (also yay for AWS using PST/PDT as default internal timezone instead of UTC...)
utc  timezones  fail  bugs  aws  aws-cli  dst  daylight-savings  time 
26 days ago by jm
It's an Emulator, Not a Petting Zoo: Emu and Lambda
a Lambda emulator in Python, suitable for unit testing lambdas
lambda  aws  coding  unit-tests  dev 
27 days ago by jm
Amazon ECS CLI Tutorial - Amazon EC2 Container Service
super-basic ECS tutorial, using a docker-compose.yml to create a new ECS-managed service fleet
ecs  cli  linux  aws  ec2  hosting  docker  tutorials 
4 weeks ago by jm
Hologram exposes an imitation of the EC2 instance metadata service on developer workstations that supports the [IAM Roles] temporary credentials workflow. It is accessible via the same HTTP endpoint to calling SDKs, so your code can use the same process in both development and production. The keys that Hologram provisions are temporary, so EC2 access can be centrally controlled without direct administrative access to developer workstations.
iam  roles  ec2  authorization  aws  adroll  open-source  cli  osx  coding  dev 
5 weeks ago by jm
(ARC308) The Serverless Company: Using AWS Lambda
Describing PlayOn! Sports' Lambda setup. Sounds pretty productionizable
ops  lambda  aws  reinvent  slides  architecture 
5 weeks ago by jm
AWS re:Invent 2015 Video & Slide Presentation Links with Easy Index
Andrew Spyker's roundup:
my quick index of all re:Invent sessions.  Please wait for a few days and I'll keep running the tool to fill in the index.  It usually takes Amazon a few weeks to fully upload all the videos and slideshares.

Pretty definitive, full text descriptions of all sessions (and there are an awful lot of 'em).
aws  reinvent  andrew-spyker  scraping  slides  presentations  ec2  video 
5 weeks ago by jm
AWS re:Invent 2015 | (CMP406) Amazon ECS at Coursera - YouTube
Coursera are running user-submitted code in ECS! interesting stuff about how they use Docker security/resource-limiting features, forking the ecs-agent code, to run user-submitted code. :O
coursera  user-submitted-code  sandboxing  docker  security  ecs  aws  resource-limits  ops 
6 weeks ago by jm
The Totally Managed Analytics Pipeline: Segment, Lambda, and Dynamo
notable mainly for the details of Terraform support for Lambda: that's a significant improvement to Lambda's production-readiness
aws  pipelines  data  streaming  lambda  dynamodb  analytics  terraform  ops 
7 weeks ago by jm
Rebuilding Our Infrastructure with Docker, ECS, and Terraform
Good writeup of current best practices for a production AWS architecture
aws  ops  docker  ecs  ec2  prod  terraform  segment  via:marc 
7 weeks ago by jm
ECJ ruling on Irish privacy case has huge significance
The only current way to comply with EU law, the judgment indicates, is to keep EU data within the EU. Whether those data can be safely managed within facilities run by US companies will not be determined until the US rules on an ongoing Microsoft case.
Microsoft stands in contempt of court right now for refusing to hand over to US authorities, emails held in its Irish data centre. This case will surely go to the Supreme Court and will be an extremely important determination for the cloud business, and any company or individual using data centre storage. If Microsoft loses, US multinationals will be left scrambling to somehow, legally firewall off their EU-based data centres from US government reach.

(cough, Amazon)
aws  hosting  eu  privacy  surveillance  gchq  nsa  microsoft  ireland 
7 weeks ago by jm
EC2 Spot Blocks for Defined-Duration Workloads
you can now launch Spot instances that will run continuously for a finite duration (1 to 6 hours). Pricing is based on the requested duration and the available capacity, and is typically 30% to 45% less than On-Demand.
ec2  aws  spot-instances  spot  pricing  time 
7 weeks ago by jm
Chaos Engineering Upgraded
some details on Netflix's Chaos Monkey, Chaos Kong and other aspects of their availability/failover testing
architecture  aws  netflix  ops  chaos-monkey  chaos-kong  testing  availability  failover  ha 
8 weeks ago by jm
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region
Painful to read, but: tl;dr: monitoring oversight, followed by a transient network glitch triggering IPC timeouts, which increased load due to lack of circuit breakers, creating a cascading failure
aws  postmortem  outages  dynamodb  ec2  post-mortems  circuit-breakers  monitoring 
9 weeks ago by jm
How We Use AWS Lambda for Rapidly Intensifying Workloads · CloudSploit
impressive -- pretty much the entire workload is run from Lambda here
lambda  aws  ec2  autoscaling  cloudsploit 
10 weeks ago by jm
Evolution of Babbel’s data pipeline on AWS: from SQS to Kinesis
Good "here's how we found it" blog post:

Our new data pipeline with Kinesis in place allows us to plug new consumers without causing any damage to the current system, so it’s possible to rewrite all Queue Workers one by one and replace them with Kinesis Workers. In general, the transition to Kinesis was smooth and there were not so tricky parts.
Another outcome was significantly reduced costs – handling almost the same amount of data as SQS, Kinesis appeared to be many times cheaper than SQS.
aws  kinesis  kafka  streaming  data-pipelines  streams  sqs  queues  architecture  kcl 
11 weeks ago by jm
Spot Bid Advisor
analyzes Spot price history to help you determine a bid price that suits your needs.
ec2  aws  spot  spot-instances  history 
12 weeks ago by jm
a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack. S3QL effectively provides a hard disk of dynamic, infinite capacity that can be accessed from any computer with internet access running Linux, FreeBSD or OS-X.
S3QL is a standard conforming, full featured UNIX file system that is conceptually indistinguishable from any local file system. Furthermore, S3QL has additional features like compression, encryption, data de-duplication, immutable trees and snapshotting which make it especially suitable for online backup and archival.
S3QL is designed to favor simplicity and elegance over performance and feature-creep. Care has been taken to make the source code as readable and serviceable as possible. Solid error detection and error handling have been included from the very first line, and S3QL comes with extensive automated test cases for all its components.
filesystems  aws  s3  storage  unix  google-storage  openstack 
12 weeks ago by jm
Amazon EC2 2015 Benchmark: Testing Speeds Between AWS EC2 and S3 Regions
Here we are again, a year later, and still no bloody percentiles! Just amateurish averaging. This is not how you measure anything, ffs. Still, better than nothing I suppose
fail  latency  measurement  aws  ec2  percentiles  s3 
august 2015 by jm
Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library - AWS Big Data Blog
Good advice on production-quality, decent-scale usage of Kinesis in Java with the official library: batching, retries, partial failures, backoff, and monitoring. (Also, jaysus, the AWS Cloudwatch API is awful, looking at this!)
kpl  aws  kinesis  tips  java  batching  streaming  production  cloudwatch  monitoring  coding 
august 2015 by jm
Amazon S3 Introduces New Usability Enhancements
bucket limit increase, and read-after-write consistency in US Standard. About time too! ;)
aws  s3  storage  consistency 
august 2015 by jm
danilop/runjop · GitHub
RunJOP (Run Just Once Please) is a distributed execution framework to run a command (i.e. a job) only once in a group of servers [built using AWS DynamoDB and S3].

nifty! Distributed cron is pretty easy when you've got Dynamo doing the heavy lifting.
dynamodb  cron  distributed-cron  scheduling  runjop  danilop  hacks  aws  ops 
july 2015 by jm
danilop/yas3fs · GitHub
YAS3FS (Yet Another S3-backed File System) is a Filesystem in Userspace (FUSE) interface to Amazon S3. It was inspired by s3fs but rewritten from scratch to implement a distributed cache synchronized by Amazon SNS notifications. A web console is provided to easily monitor the nodes of a cluster.
aws  s3  s3fs  yas3fs  filesystems  fuse  sns 
july 2015 by jm
Revised and much faster, run your own high-end cloud gaming service on EC2!
a g2.2xlarge provides decent Windows GPU performance over the internet, at about $0.53 per hour
gaming  games  ec2  amazon  aws  cloud  windows  hacks 
july 2015 by jm
VPC Flow Logs
we are introducing Flow Logs for the Amazon Virtual Private Cloud.  Once enabled for a particular VPC, VPC subnet, or Elastic Network Interface (ENI), relevant network traffic will be logged to CloudWatch Logs for storage and analysis by your own applications or third-party tools.

You can create alarms that will fire if certain types of traffic are detected; you can also create metrics to help you to identify trends and patterns. The information captured includes information about allowed and denied traffic (based on security group and network ACL rules). It also includes source and destination IP addresses, ports, the IANA protocol number, packet and byte counts, a time interval during which the flow was observed, and an action (ACCEPT or REJECT).
ec2  aws  vpc  logging  tracing  ops  flow-logs  network  tcpdump  packets  packet-capture 
june 2015 by jm
Leveraging AWS to Build a Scalable Data Pipeline
Nice detailed description of an auto-scaled SQS worker pool
sqs  aws  ec2  auto-scaling  asg  worker-pools  architecture  scalability 
june 2015 by jm
etcd Clustering in AWS
'a fully-automated solution to build auto-scaling etcd clusters in AWS'
aws  cluster  docker  etcd  asg  autoscaling  ops 
june 2015 by jm
1172401 – Add Amazon root certificates
Well, well -- looks like AWS is about to disrupt PKI, and about time too. If they come up with a Plex-style "provision a cert" API, it'll be revolutionary
pki  ssl  tls  amazon  aws  apis  web-services  ops 
june 2015 by jm
Schedule Recurring AWS Lambda Invocations With The Unreliable Town Clock (UTC)
The Unreliable Town Clock (UTC) is a new, free, public SNS Topic (Amazon Simple Notification Service) that broadcasts a “chime” message every quarter hour to all subscribers. It can send the chimes to AWS Lambda functions, SQS queues, and email addresses.

You can use the chime attributes to run your code every fifteen minutes, or only run your code once an hour (e.g., when minute == "00") or once a day (e.g., when hour == "00" and minute == "00") or any other series of intervals. You can even subscribe a function you only want to run only once at a specific time in the future: Have the function ignore all invocations until it’s after the time it wants. When it is time, it can perform its job, then unsubscribe itself from the SNS Topic.
alestic  aws  lambda  cron  time  clock  periodic-tasks  recurrence  hacks 
may 2015 by jm
Lambda: Bees with Frickin' Laser Beams
a HTTP testing tool in AWS Lambda. nice enough, but still a toy...
lambda  aws  node  javascript  hacks  http  load-testing 
may 2015 by jm
Load data into Redshift from S3 buckets using a pre-canned Lambda function. Looks like it may be a good example of production-quality Lambda
lambda  aws  ec2  redshift  s3  loaders  etl  pipeline 
may 2015 by jm "certificate verification failed" errors due to crappy Verisign certs and overzealous curl policies
Seth Vargo is correct. Its not the bit length of the key which is at issue, its the signature algorithm. The entire keychain for the key is signed with SHA1withRSA:

At issue is that the root verisign key has been marked as weak because of SHA1 and taken out of the curl bundle which is widely popular, and this issue will continue to cause more and more issues going forwards as that bundle makes it way into shipping o/s distributions and aws certification verification breaks.

'This is still happening and curl is now failing on my machine causing all sorts of fun issues (including breaking CocoaPods that are using S3 for storage).' -- @jmhodges

This may be a contributory factor to the issue @nelson saw:

Curl's ca-certs bundle is also used by Node: and doubtless many other apps and packages.

Here's a mailing list thread discussing the issue: -- looks like the curl team aren't too bothered about it.
curl  s3  amazon  aws  ssl  tls  certs  sha1  rsa  key-length  security  cacerts 
april 2015 by jm
'a command line tool that (hopefully) makes it easier to deploy, update, and test functions for AWS Lambda.' much needed IMO -- Lambda is too closed
aws  lambda  mitch-garnaat  coding  testing  cli  kappa 
april 2015 by jm
Cluster-Based Architectures Using Docker and Amazon EC2 Container Service
In this post, we’re going to take a deeper dive into the architectural concepts underlying cluster computing using container management frameworks such as ECS. We will show how these frameworks effectively abstract the low-level resources such as CPU, memory, and storage, allowing for highly efficient usage of the nodes in a compute cluster. Building on some of the concepts detailed in the earlier posts, we will discover why containers are such a good fit for this type of abstraction, and how the Amazon EC2 Container Service fits into the larger ecosystem of cluster management frameworks.
docker  aws  ecs  ec2  ops  hosting  containers  mesos  clusters 
april 2015 by jm
Amazon EC2 Container Service team AmA
a few answers here. Mostly people pointing out shortcomings and the team asking them to start a thread on their forum though :(
ec2  ecs  docker  aws  ops  ama  reddit 
april 2015 by jm
'CredStash is a very simple, easy to use credential management and distribution system that uses AWS Key Management System (KMS) for key wrapping and master-key storage, and DynamoDB for credential storage and sharing.'
aws  credstash  python  security  keys  key-management  secrets  kms 
april 2015 by jm
Run your own high-end cloud gaming service on EC2
Using Steam streaming and EC2 g2.2xlarge spot instances -- 'comes out to around $0.52/hr'. That's pretty compelling IMO
aws  ec2  gaming  games  graphics  spot-instances  hacks  windows  steam 
april 2015 by jm
Microservices and elastic resource pools with Amazon EC2 Container Service
interesting approach to working around ECS' shortcomings -- bit specific to Hailo's microservices arch and IPC mechanism though.

aside: I like their version numbering scheme: ISO-8601, YYYYMMDDHHMMSS. keep it simple!
versioning  microservices  hailo  aws  ec2  ecs  docker  containers  scheduling  allocation  deployment  provisioning  qos 
april 2015 by jm
Subscribing AWS Lambda Function To SNS Topic With aws-cli
how to use the AWS command line tools to do this
aws  aws-cli  cli  lambda  sns  hacks 
april 2015 by jm
AWS Lambda Event-Driven Architecture With Amazon SNS
Any message posted to an SNS topic can trigger the execution of custom code you have written, but you don’t have to maintain any infrastructure to keep that code available to listen for those events and you don’t have to pay for any infrastructure when the code is not being run. This is, in my opinion, the first time that Amazon can truly say that AWS Lambda is event-driven, as we now have a central, independent, event management system (SNS) where any authorized entity can trigger the event (post a message to a topic) and any authorized AWS Lambda function can listen for the event, and neither has to know about the other.
aws  ec2  lambda  sns  events  cep  event-processing  coding  cloud  hacks  eric-hammond 
april 2015 by jm
Amazon Machine Learning
Upsides of this new AWS service:

* great UI and visualisations.

* solid choice of metric to evaluate the results. Maybe things moved on since I was working on it, but the use of AUC, false positives and false negatives was pretty new when I was working on it. (er, 10 years ago!)


* it could do with more support for unsupervised learning algorithms. Supervised learning means you need to provide training data, which in itself can be hard work. My experience with logistic regression in the past is that it requires very accurate training data, too -- its tolerance for misclassified training examples is poor.

* Also, in my experience, 80% of the hard work of using ML algorithms is writing good tokenisation and feature extraction algorithms. I don't see any help for that here unfortunately. (probably not that surprising as it requires really detailed knowledge of the input data to know what classes can be abbreviated into a single class, etc.)
amazon  aws  ml  machine-learning  auc  data-science 
april 2015 by jm
(SEC307) Building a DDoS-Resilient Architecture with AWS
good slides on a "web application firewall" proxy service, deployable as an auto-scaling EC2 unit
ec2  aws  ddos  security  resilience  slides  reinvent  firewalls  http  elb 
april 2015 by jm
S3's "" endpoint
public documentation of how to work around the legacy S3 multi-region replication behaviour in North America
aws  s3  eventual-consistency  consistency  us-east  replication  workarounds  legacy 
april 2015 by jm
When S3's eventual consistency is REALLY eventual
a consistency outage in S3 last year, resulting in about 40 objects failing read-after-write consistency for a duration of about 23 hours
s3  eventual-consistency  aws  consistency  read-after-writes  bugs  outages  stackdriver 
april 2015 by jm
Can Spark Streaming survive Chaos Monkey?
good empirical results on Spark's resilience to network/host outages in EC2
ec2  aws  emr  spark  resilience  ha  fault-tolerance  chaos-monkey  netflix 
march 2015 by jm
500 Mbps upload to S3
the following guidelines maximize bandwidth usage:
Optimizing the sizes of the file parts, whether they are part of a large file or an entire small file; Optimizing the number of parts transferred concurrently.
Tuning these two parameters achieves the best possible transfer speeds to [S3].
s3  uploads  dataman  aws  ec2  performance 
march 2015 by jm
Alibaba's cloud service launches in US, wants to rain all over Amazon
server-hosting only for now. Interesting!
Alibaba’s cloud platform already competes with the likes of AWS in China. Aliyun’s Chinese data centers are in Beijing, Hangzhou, Qingdao, Hong Kong, and Shenzhen. “For the time being, we are just testing the water,” Yu said today. That means Aliyun will focus first on Chinese companies doing business in the US. “We know well what Chinese clients need, and now it’s time for us to learn what US clients need,” he added.
alibaba  china  aws  aliyun  hosting 
march 2015 by jm
What Color Is Your Xen?
What a mess.
What's faster: PV, HVM, HVM with PV drivers, PVHVM, or PVH? Cloud computing providers using Xen can offer different virtualization "modes", based on paravirtualization (PV), hardware virtual machine (HVM), or a hybrid of them. As a customer, you may be required to choose one of these. So, which one?
ec2  linux  performance  aws  ops  pv  hvm  xen  virtualization 
february 2015 by jm
Azul Zing on Ubuntu on AWS Marketplace
hmmm, very interesting -- the super-low-latency Zing JVM is available as a commercial EC2 instance type, at costs less than the EC2 instance price
zing  azul  latency  performance  ec2  aws 
february 2015 by jm
0x74696d | Falling In And Out Of Love with DynamoDB, Part II
Good DynamoDB real-world experience post, via Mitch Garnaat. We should write up ours, although it's pretty scary-stuff-free by comparison
aws  dynamodb  storage  databases  architecture  ops 
february 2015 by jm
Comparing Message Queue Architectures on AWS
A good overview -- I like the summary table. tl;dr:
If you are light on DevOps and not latency sensitive use SQS for job management and Kinesis for event stream processing. If latency is an issue, use ELB or 2 RabbitMQs (or 2 beanstalkds) for job management and Redis for event stream processing.
amazon  architecture  aws  messaging  queueing  elb  rabbitmq  beanstalk  kinesis  sqs  redis  kafka 
february 2015 by jm
AWS Tips I Wish I'd Known Before I Started
Some good advice and guidelines (although some are just silly).
aws  ops  tips  advice  ec2  s3 
january 2015 by jm
AWS Key Management Service Cryptographic Details
"AWS Key Management Service (AWS KMS) provides cryptographic keys and operations scaled for the cloud. AWS KMS keys and functionality are used by other AWS cloud services, and you can use them to protect user data in your applications that use AWS. This white paper provides details on the cryptographic operations that are executed within AWS when you use AWS KMS."
white-papers  aws  amazon  kms  key-management  crypto  pdf 
december 2014 by jm
Aurora for MySQL is coming
'Anurag@AWS posts a quite interesting comment on Aurora failover: We asynchronously write to 6 copies and ack the write when we see four completions. So, traditional 4/6 quorums with synchrony as you surmised. Now, each log record can end up with a independent quorum from any other log record, which helps with jitter, but introduces some sophistication in recovery protocols. We peer to peer to fill in holes. We also will repair bad segments in the background, and downgrade to a 3/4 quorum if unable to place in an AZ for any extended period. You need a pretty bad failure to get a write outage.' (via High Scalability)
via:highscalability  mysql  aurora  failover  fault-tolerance  aws  replication  quorum 
december 2014 by jm
(SDD416) Amazon EBS Deep Dive | AWS re:Invent 2014
Excellent data on current EBS performance characteristics
ebs  ops  aws  reinvent  slides 
november 2014 by jm
AWS re:Invent 2014 Video & Slide Presentation Links
Nice work by Andrew Spyker -- this should be an official feature of the re:Invent website, really
reinvent  aws  conferences  talks  slides  ec2  s3  ops  presentations 
november 2014 by jm
AWS re:Invent 2014 | (SPOT302) Under the Covers of AWS: Its Core Distributed Systems - YouTube
This is a really solid talk -- not surprising, alv@ is one of the speakers!
"AWS and operate some of the world's largest distributed systems infrastructure and applications. In our past 18 years of operating this infrastructure, we have come to realize that building such large distributed systems to meet the durability, reliability, scalability, and performance needs of AWS requires us to build our services using a few common distributed systems primitives. Examples of these primitives include a reliable method to build consensus in a distributed system, reliable and scalable key-value store, infrastructure for a transactional logging system, scalable database query layers using both NoSQL and SQL APIs, and a system for scalable and elastic compute infrastructure.

In this session, we discuss some of the solutions that we employ in building these primitives and our lessons in operating these systems. We also cover the history of some of these primitives -- DHTs, transactional logging, materialized views and various other deep distributed systems concepts; how their design evolved over time; and how we continue to scale them to AWS. "

scale  scaling  aws  amazon  dht  logging  data-structures  distcomp  via:marc-brooker  dynamodb  s3 
november 2014 by jm
an [XPath-style] query language for JSON. You can extract and transform elements from a JSON document.

Supported by the "aws" CLI tool, and in boto.
aws  boto  jmespath  json  xpath  querying  languages  documents 
november 2014 by jm
DynamoDB Streams
This is pretty awesome. All changes to a DynamoDB table can be streamed to a Kinesis stream, MySQL-replication-style.

The nice bit is that it has a solid way to ensure readers won't get overwhelmed by the stream volume (since ddb tables are IOPS-rate-limited), and Kinesis has a solid way to read missed updates (since it's a Kafka-style windowed persistent stream). With this you have a pretty reliable way to ensure you're not going to suffer data loss.
iops  dynamodb  aws  kinesis  reliability  replication  multi-az  multi-region  failover  streaming  kafka 
november 2014 by jm
Doing Constant Work to Avoid Failures
A good example of a design pattern -- by performing a relatively constant amount of work regardless of the input, we can predict scalability and reduce the risk of overload when something unexpected changes in that input
scalability  scaling  architecture  aws  route53  via:brianscanlan  overload  constant-load  loading 
november 2014 by jm
Why We Didn’t Use Kafka for a Very Kafka-Shaped Problem
A good story of when Kafka _didn't_ fit the use case:
We came up with a complicated process of app-level replication for our messages into two separate Kafka clusters. We would then do end-to-end checking of the two clusters, detecting dropped messages in each cluster based on messages that weren’t in both.

It was ugly. It was clearly going to be fragile and error-prone. It was going to be a lot of app-level replication and horrible heuristics to see when we were losing messages and at least alert us, even if we couldn’t fix every failure case.

Despite us building a Kafka prototype for our ETL — having an existing investment in it — it just wasn’t going to do what we wanted. And that meant we needed to leave it behind, rewriting the ETL prototype.
cassandra  java  kafka  scala  network-partitions  availability  multi-region  multi-az  aws  replication  onlive 
november 2014 by jm
Zookeeper: not so great as a highly-available service registry
Turns out ZK isn't a good choice as a service discovery system, if you want to be able to use that service discovery system while partitioned from the rest of the ZK cluster:
I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances.  This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones.  What I saw was that the two other instances noticed the first server “going away”, but they continued to function as they still saw a majority (66%).  More interestingly the first instance noticed the other two servers “going away”, dropping the ensemble availability to 33%.  This caused the first server to stop serving requests to clients (not only writes, but also reads).

So: within that offline AZ, service discovery *reads* (as well as writes) stopped working due to a lack of ZK quorum. This is quite a feasible outage scenario for EC2, by the way, since (at least when I was working there) the network links between AZs, and the links with the external internet, were not 100% overlapping.

In other words, if you want a highly-available service discovery system in the fact of network partitions, you want an AP service discovery system, rather than a CP one -- and ZK is a CP system.

Another risk, noted on the Netflix Eureka mailing list at :

ZooKeeper, while tolerant against single node failures, doesn't react well to long partitioning events. For us, it's vastly more important that we maintain an available registry than a necessarily consistent registry. If us-east-1d sees 23 nodes, and us-east-1c sees 22 nodes for a little bit, that's OK with us.

I guess this means that a long partition can trigger SESSION_EXPIRED state, resulting in ZK client libraries requiring a restart/reconnect to fix. I'm not entirely clear what happens to the ZK cluster itself in this scenario though.

Finally, Pinterest ran into other issues relying on ZK for service discovery and registration, described at ; sounds like this was mainly around load and the "thundering herd" overload problem. Their workaround was to decouple ZK availability from their services' availability, by building a Smartstack-style sidecar daemon on each host which tracked/cached ZK data.
zookeeper  service-discovery  ops  ha  cap  ap  cp  service-registry  availability  ec2  aws  network  partitions  eureka  smartstack  pinterest 
november 2014 by jm
Elastic MapReduce vs S3
Turns out there are a few bugs in EMR's S3 support, believe it or not.

1. 'Consider disabling Hadoop's speculative execution feature if your cluster is experiencing Amazon S3 concurrency issues. You do this through the and mapred.reduce.tasks.speculative.execution configuration settings. This is also useful when you are troubleshooting a slow cluster.'

2. Upgrade to AMI 3.1.0 or later, otherwise retries of S3 ops don't work.
s3  emr  hadoop  aws  bugs  speculative-execution  ops 
october 2014 by jm
Load testing Apache Kafka on AWS
This is a very solid benchmarking post, examining Kafka in good detail. Nicely done. Bottom line:
I basically spend 2/3 of my work time torture testing and operationalizing distributed systems in production. There's some that I'm not so pleased with (posts pending in draft forever) and some that have attributes that I really love. Kafka is one of those systems that I pretty much enjoy every bit of, and the fact that it performs predictably well is only a symptom of the reason and not the reason itself: the authors really know what they're doing. Nothing about this software is an accident. Performance, everything in this post, is only a fraction of what's important to me and what matters when you run these systems for real. Kafka represents everything I think good distributed systems are about: that thorough and explicit design decisions win.
testing  aws  kafka  ec2  load-testing  benchmarks  performance 
october 2014 by jm
'a set of command line tools for managing Route53 DNS for an AWS infrastructure. It intelligently uses tags and other metadata to automatically create the associated DNS records.'
zonify  aws  dns  ec2  route53  ops 
october 2014 by jm
« earlier      
per page:    204080120160

related tags

10/8  aas  acm  acm-queue  adrian-cockcroft  adrian-cockroft  adroll  advent  advice  alestic  alibaba  aliyun  allocation  ama  amazon  ami  analytics  andrew-spyker  anti-fraud  ap  aphyr  api  apis  architecture  asg  asgard  atlas  auc  aurora  authentication  authorization  auto-scaling  autoscaling  availability  aws  aws-cli  awscli  azul  azure  b2b  b2c  backup  backups  batch  batching  beanstalk  benchmarks  blue-green-deployments  boto  bugs  burst  cacerts  campaigns  cap  cassandra  cd  cdn  cep  certs  chaos-kong  chaos-monkey  chef  china  chris-newcombe  circuit-breakers  cli  clients  clock  cloud  cloud-connect  cloudformation  cloudnative  cloudsearch  cloudsmith  cloudsploit  cloudwatch  cluster  clustering  clusters  code-spaces  codedeploy  coding  comet  command-line  comparison  conferences  consistency  constant-load  containers  continuous-delivery  corporate-culture  corruption  costs  counters  coursera  cp  cpu  credstash  cron  cross-region  crypto  culture  curl  danilop  data  data-pipelines  data-protection  data-science  data-structures  databases  dataman  daylight-savings  ddos  dedupe  delete  delivery  delta  demo  deploy  deployment  design  dev  devops  dht  disks  distcomp  distributed-cron  distsys  dns  docker  documentation  documents  dos  dropbox  dst  duplicity  duply  dynalite  dynamodb  dynect  ebs  ec2  ecommerce  ecs  elasticache  elb  email  embedded  emr  eric-hammond  etcd  etl  eu  eu-central-1  eureka  event-processing  event-streaming  events  eventual-consistency  examples  extortion  fail  failover  failures  fault-tolerance  ffmpeg  figures  filesystems  firewalls  five-eyes  flow-logs  fluentd  formal-methods  fraud  freebsd  fuse  games  gaming  gce  gchq  germany  gilt  github  google  google-storage  gossip  graphics  grey-failures  h264  ha  hacks  hadoop  hailo  haproxy  hardware  history  hls  hosting  http  https  hvm  hystrix  iam  incident-response  infrastructure  instances  integration-testing  inter-region  internet  io  iops  iostat  ip-addresses  ipc  ireland  ironfan  james-hamilton  java  javascript  jepsen  jmespath  jobs  json  kafka  kappa  kcl  key-length  key-management  key-rotation  keys  kinesis  kms  knife  kpl  kubernetes  lambda  languages  latency  law  legacy  leveldb  libraries  limits  linux  load  load-balancing  load-testing  loaders  loading  logging  lucene  machine-learning  management  marc-brooker  measurement  memcached  memory  mesos  messaging  metrics  mfa  microservices  microsoft  mit  mitch-garnaat  ml  mocking  mocks  model-checking  mongodb  monitoring  mp4  multi-az  multi-region  mysql  netflix  network  network-partitions  networking  nginx  node  node.js  nosql  notifications  nsa  obama  object-model  onlive  oo  open-source  openstack  ops  optimization  osx  outages  overload  packet-capture  packets  partitions  pbailis  pdf  percentiles  perf  perfect-forward-secrecy  performance  periodic-tasks  pinterest  piops  pipeline  pipelines  pki  pluscal  post-mortems  postmortem  prediction  presentations  pricing  privacy  prod  production  programming  proving  provisioning  proxying  puppet  pv  pylons  python  qcon  qos  querying  queueing  queues  quora  quorum  r3  rabbitmq  raid  rdbms  rds  read-after-writes  recurrence  reddit  redis  redshift  reinvent  reliability  reliabilty  replication  resilience  resource-limits  ribbon  rightscale  roles  round-trip  route53  rsa  ruby  runjop  s3  s3fs  s3funnel  s3ql  sandboxing  scala  scalability  scale  scaling  scalr  scheduling  sched_batch  scraping  scryer  sdk  search  secrets  security  segment  servers  service-discovery  service-registry  services  ses  sha1  sharding  sift-science  simulation  slas  slides  smartstack  smtp  smugmug  snapshots  snooping  sns  sockets  solr  spark  speculative-execution  spikes  spinnaker  spot  spot-instances  sql  sqs  ssl  stack-hammer  stackdriver  stacks  startups  steam  storage  streaming  streams  surveillance  survey  sysadmin  system-tests  systemtap  tail  talks  tcp  tcpdump  tellybug  terraform  testing  tests  thrift  time  timezones  tips  tla  tla+  tlc  tls  tools  tornado  tracing  transactions  tunables  tuning  tutorials  two-factor-authentication  ubuntu  ultradns  unit-tests  unix  uploads  us-east  user-submitted-code  utc  versioning  via:brianscanlan  via:chorn  via:highscalability  via:marc  via:marc-brooker  via:matt-sergeant  via:nelson  via:pdolan  video  virtualization  vpc  web-services  white-papers  whitepapers  windows  workarounds  worker-pools  xen  xpath  yas3fs  zing  zonify  zookeeper 

Copy this bookmark: