jm + architecture   157

BinaryAlert: Serverless, Real-time & Retroactive Malware Detection
This is a serverless stack built on AWS, deployed with Terraform. Not sure what to think about this -- it still makes me shudder a little
aws  serverless  lambda  airbnb  malware  yara  binaryalert  architecture 
17 days ago by jm
Exactly-once Semantics is Possible: Here's How Apache Kafka Does it
How does this feature work? Under the covers it works in a way similar to TCP; each batch of messages sent to Kafka will contain a sequence number which the broker will use to dedupe any duplicate send. Unlike TCP, though—which provides guarantees only within a transient in-memory connection—this sequence number is persisted to the replicated log, so even if the leader fails, any broker that takes over will also know if a resend is a duplicate. The overhead of this mechanism is quite low: it’s just a few extra numeric fields with each batch of messages. As you will see later in this article, this feature add negligible performance overhead over the non-idempotent producer.
kafka  sequence-numbers  dedupe  deduplication  unique  architecture  distcomp  streaming  idempotence 
6 weeks ago by jm
Delivering Billions of Messages Exactly Once · Segment Blog
holy crap, this is exactly the wrong way to build a massive-scale deduplication system -- with a monster random-access "is this random UUID in the db" lookup
deduping  architecture  horror  segment  messaging  kafka 
7 weeks ago by jm
Open Guide to Amazon Web Services
'A lot of information on AWS is already written. Most people learn AWS by reading a blog or a “getting started guide” and referring to the standard AWS references. Nonetheless, trustworthy and practical information and recommendations aren’t easy to come by. AWS’s own documentation is a great but sprawling resource few have time to read fully, and it doesn’t include anything but official facts, so omits experiences of engineers. The information in blogs or Stack Overflow is also not consistently up to date. This guide is by and for engineers who use AWS. It aims to be a useful, living reference that consolidates links, tips, gotchas, and best practices. It arose from discussion and editing over beers by several engineers who have used AWS extensively.'
amazon  aws  guides  documentation  ops  architecture 
10 weeks ago by jm
Enough with the microservices
Good post!
Much has been written on the pros and cons of microservices, but unfortunately I’m still seeing them as something being pursued in a cargo cult fashion in the growth-stage startup world. At the risk of rewriting Martin Fowler’s Microservice Premium article, I thought it would be good to write up some thoughts so that I can send them to clients when the topic arises, and hopefully help people avoid some of the mistakes I’ve seen. The mistake of choosing a path towards a given architecture or technology on the basis of so-called best practices articles found online is a costly one, and if I can help a single company avoid it then writing this will have been worth it.
architecture  design  microservices  coding  devops  ops  monolith 
12 weeks ago by jm
_Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases_
'Amazon Aurora is a relational database service for OLTP workloads offered as part of Amazon Web Services (AWS). In this paper, we describe the architecture of Aurora and the design considerations leading to that architecture. We believe the central constraint in high throughput data processing has moved from compute and storage to the network. Aurora brings a novel architecture to the relational database to address this constraint, most notably by pushing redo processing to a multi-tenant scale-out storage service, purpose-built for Aurora. We describe how doing so not only reduces network traffic, but also allows for fast crash recovery, failovers to replicas without loss of data, and fault-tolerant, self-healing storage. We then describe how Aurora achieves consensus on durable state across numerous storage nodes using an efficient asynchronous scheme, avoiding expensive and chatty recovery protocols. Finally, having operated Aurora as a production service for over 18 months, we share the lessons we have learnt from our customers on what modern cloud applications expect from databases.'
via:rbranson  aurora  aws  amazon  databases  storage  papers  architecture 
may 2017 by jm
Instead of containerization, give me strong config & deployment primitives
Reasonable list of things Docker does badly at the moment, and a call to fix them. I still think Docker/rkt are a solid approach, if not 100% there yet though
docker  containers  complaining  whinge  networking  swarm  deployment  architecture  build  packaging 
april 2017 by jm
on Martin Fowler
mcfunley: 'I think at least 50% of my career has been either contributing to or unwinding one [Martin] Fowler-inspired disaster or another.'

See also: continuous deployment, polyglot programming, microservices

Relevant meme: https://twitter.com/mcfunley/status/857641303521206272/photo/1
funny  quotes  architecture  architecture-astronauts  martin-fowler  cargo-cults  coding  design-patterns  enterprise  continuous-deployment  cd  polyglot-programming  microservices  experts 
april 2017 by jm
AWS Greengrass
AWS Greengrass is software that lets you run local compute, messaging & data caching for connected devices in a secure way. With AWS Greengrass, connected devices can run AWS Lambda functions, keep device data in sync, and communicate with other devices securely – even when not connected to the Internet. Using AWS Lambda, Greengrass ensures your IoT devices can respond quickly to local events, operate with intermittent connections, and minimize the cost of transmitting IoT data to the cloud.

AWS Greengrass seamlessly extends AWS to devices so they can act locally on the data they generate, while still using the cloud for management, analytics, and durable storage. With Greengrass, you can use familiar languages and programming models to create and test your device software in the cloud, and then deploy it to your devices. AWS Greengrass can be programmed to filter device data and only transmit necessary information back to the cloud. AWS Greengrass authenticates and encrypts device data at all points of connection using AWS IoT’s security and access management capabilities. This way data is never exchanged between devices when they communicate with each other and the cloud without proven identity.
aws  cloud  iot  lambda  devices  offline  synchronization  architecture 
april 2017 by jm
Lessons Learned in Lambda
In case you were thinking Lambda was potentially usable yet
lambda  aws  shitshow  architecture  serverless 
april 2017 by jm
Spotify’s Love/Hate Relationship with DNS
omg somebody at Spotify really really loves DNS. They even store a DHT hash ring in it. whyyyyyyyyyyy
spotify  networking  architecture  dht  insane  scary  dns  unbound  ops 
april 2017 by jm
Deep Dive on Amazon EBS Elastic Volumes
'March 2017 AWS Online Tech Talks' -- lots about the new volume types
aws  ebs  storage  architecture  ops  slides 
march 2017 by jm
Learn redis the hard way (in production) · trivago techblog
oh god this is pretty awful. this just reads like "don't try to use Redis at scale" to me
redis  scalability  ops  architecture  horror  trivago  php 
march 2017 by jm
Segment.com on cost savings using DynamoDB, autoscaling and ECS
great post.

1. DynamoDB hot shards were a big problem -- and it is terrible that diagnosing this requires a ticket to AWS support! This heat map should be a built-in feature.

2. ECS auto-scaling gets a solid thumbs-up.

3. Switching from ELB to ALB lets them set ports dynamically for individual ECS Docker containers, and then pack as many containers as will fit on a giant EC2 instance.

4. Terraform modules to automate setup and maintainance of ECS, autoscaling groups, and ALBs
terraform  segment  architecture  aws  dynamodb  alb  elb  asg  ecs  docker 
march 2017 by jm
Martin Fowler's First Law of Distributed Object Design: Don’t
lol. I hadn't seen this one, but it's a good beatdown on distributed objects from back in 2003
distributed-objects  dcom  corba  history  martin-fowler  laws  rules  architecture  2003 
march 2017 by jm
The Occasional Chaos of AWS Lambda Runtime Performance
If our code has modest resource requirements, and can tolerate large changes in performance, then it makes sense to start with the least amount of memory necessary. On the other hand, if consistency is important, the best way to achieve that is by cranking the memory setting all the way up to 1536MB.
It’s also worth noting here that CPU-bound Lambdas may be cheaper to run over time with a higher memory setting, as Jim Conning describes in his article, “AWS Lambda: Faster is Cheaper”. In our tests, we haven’t seen conclusive evidence of that behavior, but much more data is required to draw any strong conclusions.
The other lesson learned is that Lambda benchmarks should be gathered over the course of days, not hours or minutes, in order to provide actionable information. Otherwise, it’s possible to see very impressive performance from a Lambda that might later dramatically change for the worse, and any decisions made based on that information will be rendered useless.
aws  lambda  amazon  performance  architecture  ops  benchmarks 
march 2017 by jm
Manage DynamoDB Items Using Time to Live (TTL)
good call.
Many DynamoDB users store data that has a limited useful life or is accessed less frequently over time. Some of them track recent logins, trial subscriptions, or application metrics. Others store data that is subject to regulatory or contractual limitations on how long it can be stored. Until now, these customers implemented their own time-based data management. At scale, this sometimes meant that they ran a couple of Amazon Elastic Compute Cloud (EC2) instances that did nothing more than scan DynamoDB items, check date attributes, and issue delete requests for items that were no longer needed. This added cost and complexity to their application. In order to streamline this popular and important use case, we are launching a new Time to Live (TTL) feature today. You can enable this feature on a table-by-table basis, specifying an item attribute that contains the expiration time for the item.
dynamodb  ttl  storage  aws  architecture  expiry 
february 2017 by jm
Fault Domains and the Vegas Rule | Expedia Engineering Blog
I like this concept -- analogous to AWS' AZs -- limit blast radius of an outage by explicitly defining dependency scopes
aws  az  fault-domains  vegas-rule  blast-radius  outages  reliability  architecture 
february 2017 by jm
Hadoop Internals
This is the best documentation on the topic I've seen in a while
hadoop  map-reduce  architecture  coding  java  distcomp 
february 2017 by jm
Mail-Order Kit Houses
could be ordered by mail and built by a single carpenter. Pretty cool
architecture  history  housing  us  kit-houses  mail-order  houses 
february 2017 by jm
sparkey
Spotify's read-only k/v store
spotify  sparkey  read-only  key-value  storage  ops  architecture 
february 2017 by jm
Beringei: A high-performance time series storage engine | Engineering Blog | Facebook Code
Beringei is different from other in-memory systems, such as memcache, because it has been optimized for storing time series data used specifically for health and performance monitoring. We designed Beringei to have a very high write rate and a low read latency, while being as efficient as possible in using RAM to store the time series data. In the end, we created a system that can store all the performance and monitoring data generated at Facebook for the most recent 24 hours, allowing for extremely fast exploration and debugging of systems and services as we encounter issues in production.

Data compression was necessary to help reduce storage overhead. We considered several existing compression schemes and rejected the techniques that applied only to integer data, used approximation techniques, or needed to operate on the entire dataset. Beringei uses a lossless streaming compression algorithm to compress points within a time series with no additional compression used across time series. Each data point is a pair of 64-bit values representing the timestamp and value of the counter at that time. Timestamps and values are compressed separately using information about previous values. Timestamp compression uses a delta-of-delta encoding, so regular time series use very little memory to store timestamps.

From analyzing the data stored in our performance monitoring system, we discovered that the value in most time series does not change significantly when compared to its neighboring data points. Further, many data sources only store integers (despite the system supporting floating point values). Knowing this, we were able to tune previous academic work to be easier to compute by comparing the current value with the previous value using XOR, and storing the changed bits. Ultimately, this algorithm resulted in compressing the entire data set by at least 90 percent.
beringei  compression  facebook  monitoring  tsd  time-series  storage  architecture 
february 2017 by jm
Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go - Uber Engineering Blog

a competing-consumer messaging queue that is durable, fault-tolerant, highly available and scalable. We achieve durability and fault-tolerance by replicating messages across storage hosts, and high availability by leveraging the append-only property of messaging queues and choosing eventual consistency as our basic model. Cherami is also scalable, as the design does not have single bottleneck. [...]
Cherami is completely written in Go, a language that makes building highly performant and concurrent system software a lot of fun. Additionally, Cherami uses several libraries that Uber has already open sourced: TChannel for RPC and Ringpop for health checking and group membership. Cherami depends on several third-party open source technologies: Cassandra for metadata storage, RocksDB for message storage, and many other third-party Go packages that are available on GitHub. We plan to open source Cherami in the near future.
cherami  uber  queueing  tasks  queues  architecture  scalability  go  cassandra  rocksdb 
december 2016 by jm
Slicer: Auto-sharding for datacenter applications
Paper from Google describing one of their internal building block services:
A general purpose sharding service. I normally think of sharding as something that happens within a (typically data) service, not as a general purpose infrastructure service. What exactly is Slicer then? It has two key components: a data plane that acts as an affinity-aware load balancer, with affinity managed based on application-specified keys; and a control plane that monitors load and instructs applications processes as to which keys they should be serving at any one point in time.  In this way, the decisions regarding how to balance keys across application instances can be outsourced to the Slicer service rather than building this logic over and over again for each individual back-end service. Slicer is focused exclusively on the problem of balancing load across a given set of  backend tasks, other systems are responsible for adding and removing tasks.


interesting.
google  sharding  slicer  architecture  papers 
december 2016 by jm
Auto scaling Pinterest
notes on a second-system take on autoscaling -- Pinterest tried it once, it didn't take, and this is the rerun. I like the tandem ASG approach (spots and nonspots)
spot-instances  scaling  aws  scalability  ops  architecture  pinterest  via:highscalability 
december 2016 by jm
AWS Step Functions – Build Distributed Applications Using Visual Workflows
Looks like one of the few new AWS features that's actually usable and useful -- replacement for SWF
swf  lambda  workflows  aws  architecture 
december 2016 by jm
Simple Queue Service FIFO Queues with Exactly-Once Processing & Deduplication
great, I've looked for this so many times. Only tricky limit I can spot is the 300 tps limit, and it's US-East/US-West only for now
fifo  queues  queueing  architecture  aws  sqs 
november 2016 by jm
The Square Root Staffing Law
The square root staffing law is a rule of thumb derived from queueing theory, useful for getting an estimate of the capacity you might need to serve an increased amount of traffic.
ops  capacity  planning  rules-of-thumb  qed-regime  efficiency  architecture 
november 2016 by jm
ArquitecturB
amazing architectural-oddities Tumblr (via Present and Correct)
tumblr  art  photography  architecture  weird  odd 
october 2016 by jm
Best practices with Airflow
interesting presentation describing how to architect Airflow ETL setups; see also https://gtoonstra.github.io/etl-with-airflow/principles.html
etl  airflow  batch  architecture  systems  ops 
october 2016 by jm
Kafka Streams - Scaling up or down
this is a nice zero-config scaling story -- good work Kafka Streams
scaling  scalability  architecture  kafka  streams  ops 
october 2016 by jm
Osso
"A modern standard for event-oriented data". Avro schema, events have time and type, schema is external and not part of the Avro stream.

'a modern standard for representing event-oriented data in high-throughput operational systems. It uses existing open standards for schema definition and serialization, but adds semantic meaning and definition to make integration between systems easy, while still being size- and processing-efficient.

An Osso event is largely use case agnostic, and can represent a log message, stack trace, metric sample, user action taken, ad display or click, generic HTTP event, or otherwise. Every event has a set of common fields as well as optional key/value attributes that are typically event type-specific.'
osso  events  schema  data  interchange  formats  cep  event-processing  architecture 
september 2016 by jm
Lessons Learned from Using Regexes At Scale
great post from Loggly on production usage of regular expressions on shared, multitenant architecture, where a /.*/ can really screw things up. "NFA isn't a golden ticket" paragraph included
loggly  regexp  regexes  java  dfa  nfa  architecture 
september 2016 by jm
AWS Case Study: mytaxi
ECS, Docker, ELB, SQS, SNS, RDS, VPC, and spot instances. Pretty canonical setup these days...
The mytaxi app is also now able to predict daily and weekly spikes. In addition, it has gained the elasticity required to meet demand during special events. Herzberg describes a typical situation on New Year's Eve: “Shortly before midnight everyone needs a taxi to get to parties, and after midnight people want to go home. In past years we couldn't keep up with the demand this generated, which was around three and a half times as high as normal. In November 2015 we moved our Docker container architecture to Amazon ECS, and for the first time ever in December we were able to celebrate a new year in which our system could handle the huge number of requests without any crashes or interruptions—an accomplishment that we were extremely proud of. We had faced the biggest night on the calendar without any downtime.”
mytaxi  aws  ecs  docker  elb  sqs  sns  rds  vpc  spot-instances  ops  architecture 
august 2016 by jm
Why Uber Engineering Switched from Postgres to MySQL
Uber bringing the smackdown for the HN postgres fanclub, with some juicy technical details of issues that caused them pain. FWIW, I was bitten by crappy postgres behaviour in the past (specifically around vacuuming and pgbouncer), so I've long been a MySQL fan ;)
database  mysql  postgres  postgresql  uber  architecture  storage  sql 
july 2016 by jm
QA Instability Implies Production Instability
Invariably, when I see a lot of developer effort in production support I also find an unreliable QA environment. It is both unreliable in that it is frequently not available for testing, and unreliable in the sense that the system’s behavior in QA is not a good predictor of its behavior in production.
qa  testing  architecture  patterns  systems  production 
july 2016 by jm
Camille Fournier's excellent rant on microservices
I haven’t even gotten into the fact that your microservices are an inter-dependent environment, as much as you may wish otherwise, and one service acting up can cause operational problems for the whole team. Maybe if you have Netflix-scale operational hardening that’s not a problem. Do you? Really? Is that the best place to spend your focus and money right now, all so teams can throw shit against the wall to see if it sticks?
Don’t sell people fantasies. This is not the reality for a mid-sized tech team working in microservices. There are enough valuable components to building out such a system without the fantastical claims of self-organizing teams who build cool hack projects in 2 week sprints that change the business. Microservices don’t make organizational problems disappear due to self-organization. They allow for some additional degrees of team and process independence and force very explicit decoupling, in exchange, there is overall system complexity and overall system coordination overhead. I personally think that’s enough value, especially when you are coming from a monolith that is failing to scale, but this model is not a panacea.
microservices  rants  camille-fournier  architecture  decoupling  dependencies 
july 2016 by jm
Open Sourcing Twitter Heron
Twitter are open sourcing their Storm replacement, and moving it to an independent open source foundation
open-source  twitter  heron  storm  streaming  architecture  lambda-architecture 
may 2016 by jm
Neutrino Software Load Balancer
eBay's software LB, supporting URL matching, comparable to haproxy, built using Netty and Scala. Used in their QA infrastructure it seems
netty  scala  ebay  load-balancing  load-balancers  url  http  architecture 
february 2016 by jm
"What the hell have you built"
cut out and keep PNG for many occasions! "Why is Redis talking to MongoDB?"
mongodb  redis  funny  architecture  gifs  png  reactiongifs 
january 2016 by jm
5 subtle ways you're using MySQL as a queue, and why it'll bite you
Excellent post from Percona. I particularly like that they don't just say "don't use MySQL" -- they give good advice on how it can be made work: 1) avoid polling; 2) avoid locking; and 3) avoid storing your queue in the same table as other data.
database  mysql  queueing  queue  messaging  percona  rds  locking  sql  architecture 
january 2016 by jm
Google Cloud Platform HTTP/HTTPS Load Balancing
GCE's LB product is pretty nice -- HTTP/2 support, and a built-in URL mapping feature (presumably based on how Google approach that problem internally, I understand they take that approach). I'm hoping AWS are taking notes for the next generation of ELB, if that ever happens
elb  gce  google  load-balancing  http  https  spdy  http2  urls  request-routing  ops  architecture  cloud 
october 2015 by jm
Holistic Configuration Management at Facebook
How FB push config changes from Git (where it is code reviewed, version controlled, and history tracked with strong auth) to Zeus (their Zookeeper fork) and from there to live production servers.
facebook  configuration  zookeeper  git  ops  architecture 
october 2015 by jm
How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code
using Spark, Tesseract, HBase, Solr and Leptonica. Actually pretty feasible
spark  tesseract  hbase  solr  leptonica  pdfs  scanning  cloudera  hadoop  architecture 
october 2015 by jm
(ARC308) The Serverless Company: Using AWS Lambda
Describing PlayOn! Sports' Lambda setup. Sounds pretty productionizable
ops  lambda  aws  reinvent  slides  architecture 
october 2015 by jm
Designing the Spotify perimeter
How Spotify use nginx as a frontline for their sites and services
scaling  spotify  nginx  ops  architecture  ssl  tls  http  frontline  security 
october 2015 by jm
SuperChief: From Apache Storm to In-House Distributed Stream Processing
Another sorry tale of Storm issues:
Storm has been successful at Librato, but we experienced many of the limitations cited in the Twitter Heron: Stream Processing at Scale paper and outlined here by Adrian Colyer, including:
Inability to isolate, reason about, or debug performance issues due to the worker/executor/task paradigm. This led to building and configuring clusters specifically designed to attempt to mitigate these problems (i.e., separate clusters per topology, only running a worker per server.), which added additional complexity to development and operations and also led to over-provisioning.
Ability of tasks to move around led to difficult to trace performance problems.
Storm’s work provisioning logic led to some tasks serving more Kafka partitions than others. This in turn created latency and performance issues that were difficult to reason about. The initial solution was to over-provision in an attempt to get a better hashing/balancing of work, but eventually we just replaced the work allocation logic.
Due to Storm’s architecture, it was very difficult to get a stack trace or heap dump because the processes that managed workers (Storm supervisor) would often forcefully kill a Java process while it was being investigated in this way.
The propensity for unexpected and subsequently unhandled exceptions to take down an entire worker led to additional defensive verbose error handling everywhere.
This nasty bug STORM-404 coupled with the aforementioned fact that a single exception can take down a worker led to several cascading failures in production, taking down entire topologies until we upgraded to 0.9.4.
Additionally, we found the performance we were getting from Storm for the amount of money we were spending on infrastructure was not in line with our expectations. Much of this is due to the fact that, depending upon how your topology is designed, a single tuple may make multiple hops across JVMs, and this is very expensive. For example, in our time series aggregation topologies a single tuple may be serialized/deserialized and shipped across the wire 3-4 times as it progresses through the processing pipeline.
scalability  storm  kafka  librato  architecture  heron  ops 
october 2015 by jm
Chaos Engineering Upgraded
some details on Netflix's Chaos Monkey, Chaos Kong and other aspects of their availability/failover testing
architecture  aws  netflix  ops  chaos-monkey  chaos-kong  testing  availability  failover  ha 
september 2015 by jm
Anatomy of a Modern Production Stack
Interesting post, but I think it falls into a common trap for the xoogler or ex-Amazonian -- assuming that all the BigCo mod cons are required to operate, when some are luxuries than can be skipped for a few years to get some real products built
architecture  ops  stack  docker  containerization  deployment  containers  rkt  coreos  prod  monitoring  xooglers 
september 2015 by jm
Evolution of Babbel’s data pipeline on AWS: from SQS to Kinesis
Good "here's how we found it" blog post:

Our new data pipeline with Kinesis in place allows us to plug new consumers without causing any damage to the current system, so it’s possible to rewrite all Queue Workers one by one and replace them with Kinesis Workers. In general, the transition to Kinesis was smooth and there were not so tricky parts.
Another outcome was significantly reduced costs – handling almost the same amount of data as SQS, Kinesis appeared to be many times cheaper than SQS.
aws  kinesis  kafka  streaming  data-pipelines  streams  sqs  queues  architecture  kcl 
september 2015 by jm
You're probably wrong about caching
Excellent cut-out-and-keep guide to why you should add a caching layer. I've been following this practice for the past few years, after I realised that #6 (recovering from a failed cache is hard) is a killer -- I've seen a few large-scale outages where a production system had gained enough scale that it required a cache to operate, and once that cache was damaged, bringing the system back online required a painful rewarming protocol. Better to design for the non-cached case if possible.
architecture  caching  coding  design  caches  ops  production  scalability 
september 2015 by jm
Scaling Analytics at Amplitude
Good blog post on Amplitude's lambda architecture setup, based on S3 and a custom "real-time set database" they wrote themselves.

antirez' comment from a Redis angle on the set database: http://antirez.com/news/92

HN thread: https://news.ycombinator.com/item?id=10118413
lambda-architecture  analytics  via:hn  redis  set-storage  storage  databases  architecture  s3  realtime 
august 2015 by jm
What does it take to make Google work at scale? [slides]
50-slide summary of Google's stack, compared vs Facebook, Yahoo!, and open-source-land, with the odd interesting architectural insight
google  architecture  slides  scalability  bigtable  spanner  facebook  gfs  storage 
august 2015 by jm
jgc on Cloudflare's log pipeline
Cloudflare are running a 40-machine, 50TB Kafka cluster, ingesting at 15 Gbps, for log processing. Also: Go producers/consumers, capnproto as wire format, and CitusDB/Postgres to store rolled-up analytics output. Also using Space Saver (top-k) and HLL (counting) estimation algorithms.
logs  cloudflare  kafka  go  capnproto  architecture  citusdb  postgres  analytics  streaming 
june 2015 by jm
Semian
Hystrix-style Circuit Breakers and Bulkheads for Ruby/Rails, from Shopify
circuit-breaker  bulkhead  patterns  architecture  microservices  shopify  rails  ruby  networking  reliability  fallback  fail-fast 
june 2015 by jm
Leveraging AWS to Build a Scalable Data Pipeline
Nice detailed description of an auto-scaled SQS worker pool
sqs  aws  ec2  auto-scaling  asg  worker-pools  architecture  scalability 
june 2015 by jm
Twitter ditches Storm
in favour of a proprietary ground-up rewrite called Heron. Reading between the lines it sounds like Storm had problems with latency, reliability, data loss, and supporting back pressure.
analytics  architecture  twitter  storm  heron  backpressure  streaming  realtime  queueing 
june 2015 by jm
Elements of Scale: Composing and Scaling Data Platforms
Great, encyclopedic blog post rounding up common architectural and algorithmic patterns using in scalable data platforms. Cut out and keep!
architecture  storage  databases  data  big-data  scaling  scalability  ben-stopford  cqrs  druid  parquet  columnar-stores  lambda-architecture 
may 2015 by jm
Eric Brewer interview on Kubernetes
What is the relationship between Kubernetes, Borg and Omega (the two internal resource-orchestration systems Google has built)?

I would say, kind of by definition, there’s no shared code but there are shared people.

You can think of Kubernetes — especially some of the elements around pods and labels — as being lessons learned from Borg and Omega that are, frankly, significantly better in Kubernetes. There are things that are going to end up being the same as Borg — like the way we use IP addresses is very similar — but other things, like labels, are actually much better than what we did internally.

I would say that’s a lesson we learned the hard way.
google  architecture  kubernetes  docker  containers  borg  omega  deployment  ops 
may 2015 by jm
'Microservice AntiPatterns'
presentation from last week's Craft Conference in Budapest; Tammer Saleh of Pivotal with a few antipatterns observed in dealing with microservices.
microservices  soa  architecture  design  coding  software  presentations  slides  tammer-saleh  pivotal  craft 
april 2015 by jm
StackShare
'Discover and discuss the best dev tools and cloud infrastructure services' -- fun!
stackshare  architecture  stack  ops  software  ranking  open-source 
april 2015 by jm
Internet Scale Services Checklist
good aspirational checklist, inspired heavily by James Hamilton's seminal 2007 paper, "On Designing And Deploying Internet-Scale Services"
james-hamilton  checklists  ops  internet-scale  architecture  operability  monitoring  reliability  availability  uptime  aspirations 
april 2015 by jm
Making Pinterest — Learn to stop using shiny new things and love MySQL
'The third reason people go for shiny is because older tech isn’t advertised as aggressively as newer tech. The younger companies needs to differentiate from the old guard and be bolder, more passionate and promise to fulfill your wildest dreams. But most new tech sales pitches aren’t generally forthright about their many failure modes. In our early days, we fell into this third trap. We had a lot of growing pains as we scaled the architecture. The most vocal and excited database companies kept coming to us saying they’d solve all of our scalability problems. But nobody told us of the virtues of MySQL, probably because MySQL just works, and people know about it.'

It's true! -- I'm still a happy MySQL user for some use cases, particularly read-mostly relational configuration data...
mysql  storage  databases  reliability  pinterest  architecture 
april 2015 by jm
RADStack - an open source Lambda Architecture built on Druid, Kafka and Samza
'In this paper we presented the RADStack, a collection of complementary technologies that can be used together to power interactive analytic applications. The key pieces of the stack are Kafka, Samza, Hadoop, and Druid. Druid is designed for exploratory analytics and is optimized for low latency data exploration, aggregation, and ingestion, and is well suited for OLAP workflows. Samza and Hadoop complement Druid and add data processing functionality, and Kafka enables high throughput event delivery.'
druid  samza  kafka  streaming  cep  lambda-architecture  architecture  hadoop  big-data  olap 
april 2015 by jm
Kafka best practices
This is the second part of our guide on streaming data and Apache Kafka. In part one I talked about the uses for real-time data streams and explained our idea of a stream data platform. The remainder of this guide will contain specific advice on how to go about building a stream data platform in your organization.


tl;dr: limit the number of Kafka clusters; use Avro.
architecture  kafka  storage  streaming  event-processing  avro  schema  confluent  best-practices  tips 
march 2015 by jm
The official REST Proxy for Kafka
The REST Proxy is an open source HTTP-based proxy for your Kafka cluster. The API supports many interactions with your cluster, including producing and consuming messages and accessing cluster metadata such as the set of topics and mapping of partitions to brokers. Just as with Kafka, it can work with arbitrary binary data, but also includes first-class support for Avro and integrates well with Confluent’s Schema Registry. And it is scalable, designed to be deployed in clusters and work with a variety of load balancing solutions.

We built the REST Proxy first and foremost to meet the growing demands of many organizations that want to use Kafka, but also want more freedom to select languages beyond those for which stable native clients exist today. However, it also includes functionality beyond traditional clients, making it useful for building tools for managing your Kafka cluster. See the documentation for a more detailed description of the included features.
kafka  rest  proxies  http  confluent  queues  messaging  streams  architecture 
march 2015 by jm
Backblaze Vaults: Zettabyte-Scale Cloud Storage Architecture
Backblaze deliver their take on nearline storage:

'Backblaze’s cloud storage Vaults deliver 99.99999% annual durability, horizontal scalability, and 20 Gbps of per-Vault performance, while being operationally efficient and extremely cost effective. Driven from the same mindset that we brought to the storage market with Backblaze Storage Pods, Backblaze Vaults continue our singular focus of building the most cost-efficient cloud storage around.'
architecture  backup  storage  backblaze  nearline  offline  reed-solomon  error-correction 
march 2015 by jm
A Journey into Microservices | Hailo Tech Blog
Excellent three-parter from Hailo, describing their RabbitMQ+Go-based microservices architecture. Very impressive!
hailo  go  microservices  rabbitmq  amqp  architecture  blogs 
march 2015 by jm
Services Engineering Reading List
good list of papers/articles for fans of scalability etc.
architecture  papers  reading  reliability  scalability  articles  to-read 
march 2015 by jm
« earlier      
per page:    204080120160

related tags

99th-percentile  acid  ad-tracking  ads  advice  aggregation  agpl  airbnb  airflow  ajax  alb  algorithms  ama  amazon  amqp  analytics  animation  anti-spam  antipatterns  anycast  api  api-gateway  append  architecture  architecture-astronauts  archival  art  articles  asg  aspirations  aurora  auto-scaling  autosave  availability  avro  aws  az  backblaze  backend  backpressure  backup  backups  bandwidth  banking  banks  batch  batch-processing  beanstalk  ben-stopford  benchmarks  beringei  best-practices  big-data  bigtable  bijection  binaryalert  bitcoin  blast-radius  blinkenlights  blogs  boost  borg  brutalism  buffering  build  buildings  bulkhead  cache  caches  caching  callbacks  camille-fournier  canary-requests  cap  capacity  capnproto  cargo-cults  cascading  cassandra  cd  cep  chaos-kong  chaos-monkey  checklists  cherami  chip-and-pin  circuit-breaker  citicorp-center  cities  citusdb  clients  clojure  cloud  cloud-connect  cloudera  cloudflare  clustering  coding  collaboration  colour  column-oriented  columnar-stores  comet  complaining  compression  concurrency  configuration  confluent  connection-pooling  constant-load  containerization  containers  continuous-deployment  corba  coreos  costs  cqrs  craft  crash-only-software  crashing  crime  cron  crypto  culture  danger  darach-ennis  data  data-pipelines  database  databases  datasift  datomic  dcom  debugging  decoupling  dedupe  deduping  deduplication  denmark  dependencies  deployment  design  design-patterns  devices  devops  dfa  dht  digging  dist-sys  distcomp  distributed  distributed-objects  distributed-systems  distributed-transactions  dns  docker  documentation  druid  dublin  duckduckgo  dynamic-languages  dynamodb  earlybird  ebay  ebs  ec2  ecs  efficiency  elasticsearch  elb  emv  enterprise  erlang  error-correction  esb  etl  etsy  eureka  event-processing  events  exactly-once-delivery  experts  expiry  export  facebook  fail  fail-fast  failover  failsafe  fallback  fault-domains  fault-tolerance  fifo  figures  finance  firehose  fog-creek  formal-methods  formats  frontline  funny  games  gaming  gce  generation  geoip  gfs  gifs  gilt-groupe  git  go  google  graph  graphics  groovy  guidelines  guides  ha  hadoop  hailo  handwaving  haproxy  hardware  hashing  hbase  hdfs  hedged-requests  heron  high-availability  high-scalability  history  horror  houses  housing  http  http2  https  hystrix  idempotence  images  impala  incast  ingestion  insane  installations  interception  interchange  internet  internet-scale  interviews  iot  ipc  ireland  james-hamilton  java  javascript  jay-kreps  jeff-dean  jitter  json  jsr-133  kafka  kcl  key-value  kinesis  kit-houses  knockout-js  kubernetes  la  lambda  lambda-architecture  lapd  latency  lawful-interception  laws  layer-7  leptonica  let-it-fail  li  libraries  librato  linkedin  linux  lists  llvm  load-balancers  load-balancing  loading  locking  lockstep  log  logging  loggly  logs  long-polling  low-latency  lru  mail-order  malware  map-reduce  marc-brooker  martin-fowler  mass-dampers  memcached  memory  mental  messaging  metrics  michael-nygard  microreboot  microservices  mobile  mongodb  monitoring  monolith  multifeed  music  mvvm  mysql  mytaxi  nanoservices  narus  nearline  netflix  netflixoss  netty  network  network-partitions  networking  nfa  nginx  node.js  nosql  nsa  nyc  odd  offline  olap  omega  omgwtfbbq  open-source  opensource  operability  ops  osso  outages  overload  p2p  packaging  packet-capture  panopticon  papers  parquet  partitioning  partitions  patterns  pdf  pdfs  percona  performance  periodic  phones  photography  photos  php  physics  pinterest  pivotal  planning  playhouse  pluscal  png  polyglot-programming  postgres  postgresql  presentation  presentations  procedural  prod  production  programming  proving  proxies  push  pushpin  python  qa  qcon  qed-regime  queue  queueing  queues  quora  quotes  rabbitmq  rails  rainbow  randomness  ranking  rants  rds  reactiongifs  read-only  reading  realtime  recovery  reddit  redis  redundancy  reed-solomon  regexes  regexp  reinvent  reliability  replication  reputation  request-routing  resiliency  rest  reverse-proxy  riak  ribbon  risk  rkt  rocksdb  ross-anderson  round-trip  route53  ruby  rules  rules-of-thumb  rx  s3  saas  saga  samza  scala  scalability  scale  scaling  scanning  scary  scheduler  schema  scribe  scripting  search  secor  security  segment  sequence-numbers  sergio-bossa  server-side  serverless  service-discovery  service-metrics  services  set-storage  sewers  sharding  shitshow  shopify  sidecars  simcity  simulation  skype  slicer  slides  smartstack  sniffing  snooping  sns  soa  society  software  solr  soundcloud  spanner  spark  spark-streaming  sparkey  spdy  spectrum  spider.io  spof  spot-instances  spotify  sql  sqs  ssl  stack  stackshare  storage  storm  stream-processing  streaming  streams  subterranean  swarm  swf  synchronization  systems  tail  talks  tammer-saleh  tapping  tasks  tcp  tcp-stack  team  terraform  tesseract  testing  time-series  timing  tips  tla+  tlc  tls  to-read  tools  toread  transactions  transactor  trello  trivago  tsar  tsd  ttl  tumblr  tunnels  twitter  uber  ubuntu  ui  unbound  unique  unix  uptime  urban  url  urls  us  vegas-rule  verification  via:bldgblog  via:brianscanlan  via:fanf  via:filippo  via:highscalability  via:hn  via:ip  via:kragen  via:matt-sergeant  via:mlkshk  via:nelson  via:pbenson  via:rbranson  via:trustinlee  via:waxy  vice  video  volatile  voldemort  vpc  vvat  wan  web  webdev  webhooks  websites  websockets  weird  whinge  whoa  william-lemessurier  wind  windows  worker-pools  workflows  xkeyscore  xooglers  yara  youtube  zeromq  znodes  zookeeper 

Copy this bookmark:



description:


tags: