jm + scalability   85

Locking, Little's Law, and the USL
Excellent explanatory mailing list post by Martin Thompson to the mechanical-sympathy group, discussing Little's Law vs the USL:
Little's law can be used to describe a system in steady state from a queuing perspective, i.e. arrival and leaving rates are balanced. In this case it is a crude way of modelling a system with a contention percentage of 100% under Amdahl's law, in that throughput is one over latency.

However this is an inaccurate way to model a system with locks. Amdahl's law does not account for coherence costs. For example, if you wrote a microbenchmark with a single thread to measure the lock cost then it is much lower than in a multi-threaded environment where cache coherence, other OS costs such as scheduling, and lock implementations need to be considered.

Universal Scalability Law (USL) accounts for both the contention and the coherence costs.
http://www.perfdynamics.com/Manifesto/USLscalability.html

When modelling locks it is necessary to consider how contention and coherence costs vary given how they can be implemented. Consider in Java how we have biased locking, thin locks, fat locks, inflation, and revoking biases which can cause safe points that bring all threads in the JVM to a stop with a significant coherence component.
usl  scaling  scalability  performance  locking  locks  java  jvm  amdahls-law  littles-law  system-dynamics  modelling  systems  caching  threads  schedulers  contention 
8 weeks ago by jm
usl4j And You | codahale.com
Coda Hale wrote a handy java library implementing a USL solver
usl  scalability  java  performance  optimization  benchmarking  measurement  ops  coda-hale 
june 2017 by jm
Scaling Amazon Aurora at ticketea
Ticketing is a business in which extreme traffic spikes are the norm, rather than the exception. For Ticketea, this means that our traffic can increase by a factor of 60x in a matter of seconds. This usually happens when big events (which have a fixed, pre-announced 'sale start time') go on sale.
scaling  scalability  ops  aws  aurora  autoscaling  asg 
may 2017 by jm
Learn redis the hard way (in production) · trivago techblog
oh god this is pretty awful. this just reads like "don't try to use Redis at scale" to me
redis  scalability  ops  architecture  horror  trivago  php 
march 2017 by jm
Cherami: Uber Engineering’s Durable and Scalable Task Queue in Go - Uber Engineering Blog

a competing-consumer messaging queue that is durable, fault-tolerant, highly available and scalable. We achieve durability and fault-tolerance by replicating messages across storage hosts, and high availability by leveraging the append-only property of messaging queues and choosing eventual consistency as our basic model. Cherami is also scalable, as the design does not have single bottleneck. [...]
Cherami is completely written in Go, a language that makes building highly performant and concurrent system software a lot of fun. Additionally, Cherami uses several libraries that Uber has already open sourced: TChannel for RPC and Ringpop for health checking and group membership. Cherami depends on several third-party open source technologies: Cassandra for metadata storage, RocksDB for message storage, and many other third-party Go packages that are available on GitHub. We plan to open source Cherami in the near future.
cherami  uber  queueing  tasks  queues  architecture  scalability  go  cassandra  rocksdb 
december 2016 by jm
Auto scaling Pinterest
notes on a second-system take on autoscaling -- Pinterest tried it once, it didn't take, and this is the rerun. I like the tandem ASG approach (spots and nonspots)
spot-instances  scaling  aws  scalability  ops  architecture  pinterest  via:highscalability 
december 2016 by jm
"Solving Imaginary Scaling Issues At Scale — Getting the wrong idea from that conference talk you attended"
Amazing virtuoso performance:

Chapter 1: Databases with cool-sounding names.
Chapter 2: using BitTorrent for everything.
Chapter 3: forget Torrents. Use the blockchain for everything.
Chapter 4: sharding the database before adding any indexes.
Chapter 5: upgrading to faster processors without checking if you're limited by disk I/O.
Chapter 6: rewriting APIs in C for speed without compressing data on the wire.
Chapter 7: putting large blobs of binary data into SQL databases for fun and profit.
Chapter 8: using protobufs to poll 300 times per second.
Chapter 9: diagnose scaling issues by grepping 10 lines of code and guessing.
Chapter 10: putting Varnish in front of everything just in case.
Chapter 11: buying boxes with gigantic amounts of RAM.
Chapter 12: realizing your HAProxy box is still a micro instance.
Chapter 13: rewriting 3 of 10 features in Go and declaring victory.
Chapter 14: split everything into 35 microservices all maintained by 1 person.
Chapter 15: 300% performance boosts by deleting data validity checks.
Chapter 16: minifying the JS of your O(n^3) to-do list.
Chapter 17: Fuck It, Let's Try Erlang.
Chapter 18: Blaming Everything On The Last Person To Quit.
Chapter 19: A Bloom Filter Will Definitely Fix This.
Chapter 20: Move all client-side processing to the server and/or vice-versa.
Chapter 21: Putting A Node.js Proxy In Front Of Our COBOL Backend Will Definitely Improve Matters.
Chapter 22: A Type-Checked Transpilation Step Will Surely Speed Things Up.
Chapter 23: Writing A New Language Almost The Same As Your Old Language But Faster (guest chapter by Facebook).
Chapter 24: Replacing an SQL DB with a NoSQL DB then implementing SQL in your ORM.
Chapter 25: Migrating From Bare Metal To The Cloud Or Vice-Versa, Whichever You're Not Currently Doing.
Chapter 26: Putting everything behind a CDN except the slow, complicated parts.
Chapter 27: Applying distributed map-reduce to less than 1 gigabyte of data.
Chapter 28: Running exactly the same software, but in Docker.
Chapter 29: Machine learning: how it will magically fix your crappy code.
Chapter 30: Blaming your package manager for slow run-time performance.
Chapter 31: Moving processing from the CPU to the GPU without changing the algorithm.
Chapter 32: Switching To Heroku Or Away From Heroku Or A Hybrid Heroku-AWS model, whichever sounds the most fun.
Chapter 33: Loading all your dependencies from somebody else's github repo.
Chapter 34: optimizing your PNGs while hosting 300MB video ads.
Chapter 35: hosting your database in memory and your images on S3.
scalability  funny  lol  twitter  oreilly 
november 2016 by jm
Service discovery at Stripe
Writeup of their Consul-based service discovery system, a bit similar to smartstack. Good description of the production problems that they saw with Consul too, and also they figured out that strong consistency isn't actually what you want in a service discovery system ;)

HN comments are good too: https://news.ycombinator.com/item?id=12840803
consul  api  microservices  service-discovery  dns  load-balancing  l7  tcp  distcomp  smartstack  stripe  cap-theorem  scalability 
november 2016 by jm
Kafka Streams - Scaling up or down
this is a nice zero-config scaling story -- good work Kafka Streams
scaling  scalability  architecture  kafka  streams  ops 
october 2016 by jm
How to Quantify Scalability
good page on the Universal Scalability Law and how to apply it
usl  performance  scalability  concurrency  capacity  measurement  excel  equations  metrics 
september 2016 by jm
Hashed Wheel Timer
nice java impl of this efficient data structure, broken out from Project Reactor
scalability  java  timers  hashed-wheel-timers  algorithms  data-structures 
march 2016 by jm
Topics in High-Performance Messaging
'We have worked together in the field of high-performance messaging for many years, and in that time, have seen some messaging systems that worked well and some that didn't. Successful deployment of a messaging system requires background information that is not easily available; most of what we know, we had to learn in the school of hard knocks. To save others a knock or two, we have collected here the essential background information and commentary on some of the issues involved in successful deployments. This information is organized as a series of topics around which there seems to be confusion or uncertainty. Please contact us if you have questions or comments.'
messaging  scalability  scaling  performance  udp  tcp  protocols  multicast  latency 
december 2015 by jm
SuperChief: From Apache Storm to In-House Distributed Stream Processing
Another sorry tale of Storm issues:
Storm has been successful at Librato, but we experienced many of the limitations cited in the Twitter Heron: Stream Processing at Scale paper and outlined here by Adrian Colyer, including:
Inability to isolate, reason about, or debug performance issues due to the worker/executor/task paradigm. This led to building and configuring clusters specifically designed to attempt to mitigate these problems (i.e., separate clusters per topology, only running a worker per server.), which added additional complexity to development and operations and also led to over-provisioning.
Ability of tasks to move around led to difficult to trace performance problems.
Storm’s work provisioning logic led to some tasks serving more Kafka partitions than others. This in turn created latency and performance issues that were difficult to reason about. The initial solution was to over-provision in an attempt to get a better hashing/balancing of work, but eventually we just replaced the work allocation logic.
Due to Storm’s architecture, it was very difficult to get a stack trace or heap dump because the processes that managed workers (Storm supervisor) would often forcefully kill a Java process while it was being investigated in this way.
The propensity for unexpected and subsequently unhandled exceptions to take down an entire worker led to additional defensive verbose error handling everywhere.
This nasty bug STORM-404 coupled with the aforementioned fact that a single exception can take down a worker led to several cascading failures in production, taking down entire topologies until we upgraded to 0.9.4.
Additionally, we found the performance we were getting from Storm for the amount of money we were spending on infrastructure was not in line with our expectations. Much of this is due to the fact that, depending upon how your topology is designed, a single tuple may make multiple hops across JVMs, and this is very expensive. For example, in our time series aggregation topologies a single tuple may be serialized/deserialized and shipped across the wire 3-4 times as it progresses through the processing pipeline.
scalability  storm  kafka  librato  architecture  heron  ops 
october 2015 by jm
Uber Goes Unconventional: Using Driver Phones as a Backup Datacenter - High Scalability
Initially I thought they were just tracking client state on the phone, but it actually sounds like they're replicating other users' state, too. Mad stuff! Must cost a fortune in additional data transfer costs...
scalability  failover  multi-dc  uber  replication  state  crdts 
september 2015 by jm
You're probably wrong about caching
Excellent cut-out-and-keep guide to why you should add a caching layer. I've been following this practice for the past few years, after I realised that #6 (recovering from a failed cache is hard) is a killer -- I've seen a few large-scale outages where a production system had gained enough scale that it required a cache to operate, and once that cache was damaged, bringing the system back online required a painful rewarming protocol. Better to design for the non-cached case if possible.
architecture  caching  coding  design  caches  ops  production  scalability 
september 2015 by jm
What does it take to make Google work at scale? [slides]
50-slide summary of Google's stack, compared vs Facebook, Yahoo!, and open-source-land, with the odd interesting architectural insight
google  architecture  slides  scalability  bigtable  spanner  facebook  gfs  storage 
august 2015 by jm
Patrick Shuff - Building A Billion User Load Balancer - SCALE 13x - YouTube
'Want to learn how Facebook scales their load balancing infrastructure to support more than 1.3 billion users? We will be revealing the technologies and methods we use to route and balance Facebook's traffic. The Traffic team at Facebook has built several systems for managing and balancing our site traffic, including both a DNS load balancer and a software load balancer capable of handling several protocols. This talk will focus on these technologies and how they have helped improve user performance, manage capacity, and increase reliability.'

Can't find the standalone slides, unfortunately.
facebook  video  talks  lbs  load-balancing  http  https  scalability  scale  linux 
june 2015 by jm
Discretized Streams: Fault Tolerant Stream Computing at Scale
The paper describing the innards of Spark Streaming and its RDD-based recomputation algorithm:
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.
rdd  spark  streaming  fault-tolerance  batch  distcomp  papers  big-data  scalability 
june 2015 by jm
Leveraging AWS to Build a Scalable Data Pipeline
Nice detailed description of an auto-scaled SQS worker pool
sqs  aws  ec2  auto-scaling  asg  worker-pools  architecture  scalability 
june 2015 by jm
Elements of Scale: Composing and Scaling Data Platforms
Great, encyclopedic blog post rounding up common architectural and algorithmic patterns using in scalable data platforms. Cut out and keep!
architecture  storage  databases  data  big-data  scaling  scalability  ben-stopford  cqrs  druid  parquet  columnar-stores  lambda-architecture 
may 2015 by jm
Why Loggly loves Apache Kafka
Some good factoids about Loggly's Kafka usage and scales
scalability  logging  loggly  kafka  queueing  ops  reliabilty 
may 2015 by jm
How We Scale VividCortex's Backend Systems - High Scalability
Excellent post from Baron Schwartz about their large-scale, 1-second-granularity time series database storage system
time-series  tsd  storage  mysql  sql  baron-schwartz  ops  performance  scalability  scaling  go 
march 2015 by jm
Services Engineering Reading List
good list of papers/articles for fans of scalability etc.
architecture  papers  reading  reliability  scalability  articles  to-read 
march 2015 by jm
Are you better off running your big-data batch system off your laptop?
Heh, nice trolling.
Here are two helpful guidelines (for largely disjoint populations):

If you are going to use a big data system for yourself, see if it is faster than your laptop.
If you are going to build a big data system for others, see that it is faster than my laptop. [...]

We think everyone should have to do this, because it leads to better systems and better research.
graph  coding  hadoop  spark  giraph  graph-processing  hardware  scalability  big-data  batch  algorithms  pagerank 
january 2015 by jm
Doing Constant Work to Avoid Failures
A good example of a design pattern -- by performing a relatively constant amount of work regardless of the input, we can predict scalability and reduce the risk of overload when something unexpected changes in that input
scalability  scaling  architecture  aws  route53  via:brianscanlan  overload  constant-load  loading 
november 2014 by jm
Carbon vs Megacarbon and Roadmap ? · Issue #235 · graphite-project/carbon
Carbon is a great idea, but fundamentally, twisted doesn't do what carbon-relay or carbon-aggregator were built to do when hit with sustained and heavy throughput. Much to my chagrin, concurrency isn't one of python's core competencies.


+1, sadly. We are patching around the edges with half-released third-party C rewrites in our graphite setup, as we exceed the scale Carbon can support.
carbon  graphite  metrics  ops  python  twisted  scalability 
october 2014 by jm
On-Demand Jenkins Slaves With Amazon EC2
This is very likely where we'll be going for our acceptance tests in Swrve
testing  jenkins  ec2  spot-instances  scalability  auto-scaling  ops  build 
august 2014 by jm
Google's Influential Papers for 2013
Googlers across the company actively engage with the scientific community by publishing technical papers, contributing open-source packages, working on standards, introducing new APIs and tools, giving talks and presentations, participating in ongoing technical debates, and much more. Our publications offer technical and algorithmic advances, feature aspects we learn as we develop novel products and services, and shed light on some of the technical challenges we face at Google. Below are some of the especially influential papers co-authored by Googlers in 2013.
google  papers  toread  reading  2013  scalability  machine-learning  algorithms 
july 2014 by jm
Google Replaces MapReduce With New Hyper-Scale Cloud Analytics System
MR no more:
“We don’t really use MapReduce anymore,” [Urs] Hölzle said in his keynote presentation at the Google I/O conference in San Francisco Wednesday. The company stopped using the system “years ago.”

Cloud Dataflow, which Google will also offer as a service for developers using its cloud platform, does not have the scaling restrictions of MapReduce. “Cloud Dataflow is the result of over a decade of experience in analytics,” Hölzle said. “It will run faster and scale better than pretty much any other system out there.”

Gossip on the mech-sympathy list says that 'seems that the new platform taking over is a combination of FlumeJava and MillWheel: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf ,
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41378.pdf'
map-reduce  google  hadoop  cloud-dataflow  scalability  big-data  urs-holzle  google-io 
june 2014 by jm
Shutterbits replacing hardware load balancers with local BGP daemons and anycast
Interesting approach. Potentially risky, though -- heavy use of anycast on a large-scale datacenter network could increase the scale of the OSPF graph, which scales exponentially. This can have major side effects on OSPF reconvergence time, which creates an interesting class of network outage in the event of OSPF flapping.

Having said that, an active/passive failover LB pair will already announce a single anycast virtual IP anyway, so, assuming there are a similar number of anycast IPs in the end, it may not have any negative side effects.

There's also the inherent limitation noted in the second-to-last paragraph; 'It comes down to what your hardware router can handle for ECMP. I know a Juniper MX240 can handle 16 next-hops, and have heard rumors that a software update will bump this to 64, but again this is something to keep in mind'. Taking a leaf from the LB design, and using BGP to load-balance across a smaller set of haproxy instances, would seem like a good approach to scale up.
scalability  networking  performance  load-balancing  bgp  exabgp  ospf  anycast  routing  datacenters  scaling  vips  juniper  haproxy  shutterstock 
may 2014 by jm
Spark Streaming
an extension of the core Spark API that allows enables high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s in-built machine learning algorithms, and graph processing algorithms on data streams.
spark  streams  stream-processing  cep  scalability  apache  machine-learning  graphs 
may 2014 by jm
Why Disqus made the Python->Go switchover
for their realtime component, from the horse's mouth:
at higher contention, the CPU was choking everything. Switching over to Go removed that contention for us, which was the primary issue that we were seeing.
python  languages  concurrency  go  threading  gevent  scalability  disqus  realtime  hn 
may 2014 by jm
An analysis of Facebook photo caching
excellent analysis of caching behaviour at scale, from the FB engineering blog (via Tony Finch)
via:fanf  caching  facebook  architecture  photos  images  cache  fifo  lru  scalability 
may 2014 by jm
Scalable Atomic Visibility with RAMP Transactions
Great new distcomp protocol work from Peter Bailis et al:
We’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are. [...]

How they work: RAMP transactions allow readers and writers to proceed concurrently. Operations race, but readers autonomously detect the races and repair any non-atomic reads. The write protocol ensures readers never stall waiting for writes to arrive.

Why they scale: Clients can’t cause other clients to stall (via synchronization independence) and clients only have to contact the servers responsible for items in their transactions (via partition independence). As a consequence, there’s no mutual exclusion or synchronous coordination across servers.

The end result: RAMP transactions outperform existing approaches across a variety of workloads, and, for a workload of 95% reads, RAMP transactions scale to over 7 million ops/second on 100 servers at less than 5% overhead.
scale  synchronization  databases  distcomp  distributed  ramp  transactions  scalability  peter-bailis  protocols  sharding  concurrency  atomic  partitions 
april 2014 by jm
'Scaling to Millions of Simultaneous Connections' [pdf]
Presentation by Rick Reed of WhatsApp on the large-scale Erlang cluster backing the WhatsApp API, delivered at Erlang Factory SF, March 30 2012. lots of juicy innards here
erlang  scaling  scalability  performance  whatsapp  freebsd  presentations 
february 2014 by jm
Little’s Law, Scalability and Fault Tolerance: The OS is your bottleneck. What you can do?
good blog post on Little's Law, plugging quasar, pulsar, and comsat, 3 new open-source libs offering Erlang-like lightweight threads on the JVM
jvm  java  quasar  pulsar  comsat  littles-law  scalability  async  erlang 
february 2014 by jm
Extending graphite’s mileage
Ad company InMobi are using graphite heavily (albeit not as heavily as $work are), ran into the usual scaling issues, and chose to fix it in code by switching from a filesystem full of whisper files to a LevelDB per carbon-cache:
The carbon server is now able to run without breaking a sweat even when 500K metrics per minute is being pumped into it. This has been in production since late August 2013 in every datacenter that we operate from.


Very nice. I hope this gets merged/supported.
graphite  scalability  metrics  leveldb  storage  inmobi  whisper  carbon  open-source 
january 2014 by jm
Non-blocking transactional atomicity
interesting new distributed atomic transaction algorithm from Peter Bailis
algorithms  database  distributed  scalability  storage  peter-bailis  distcomp 
october 2013 by jm
Behind the Screens at Loggly
Boost ASIO at the front end (!), Kafka 0.8, Storm, and ElasticSearch
boost  scalability  loggly  logging  ingestion  cep  stream-processing  kafka  storm  architecture  elasticsearch 
september 2013 by jm
_MillWheel: Fault-Tolerant Stream Processing at Internet Scale_ [paper, pdf]
from VLDB 2013:

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees.

This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
millwheel  google  data-processing  cep  low-latency  fault-tolerance  scalability  papers  event-processing  stream-processing 
august 2013 by jm
New Tweets per second record, and how | Twitter Blog
How Twitter scaled up massively in 3 years -- replacing Ruby with the JVM, adopting SOA and custom sharding. Good summary post, looking forward to more techie details soon
twitter  performance  scalability  jvm  ruby  soa  scaling 
august 2013 by jm
Building a panopticon: The evolution of the NSA’s XKeyscore
This is an amazing behind-the-scenes look at the architecture of XKeyscore, and how it evolved from an earlier large-scale packet interception system, Narus' Semantic Traffic Analyzer.

XKeyscore is a federated, distributed system, with distributed packet-capture agents running on Linux, built with protocol-specific plugins, which write 3 days of raw packet data, and 30 days of intercept metadata, to local buffer stores. Central queries are then 'distributed across all of the XKeyscore tap sites, and any results are returned and aggregated'.

Dunno about you, but this is pretty much how I would have built something like this, IMO....
panopticon  xkeyscore  nsa  architecture  scalability  packet-capture  narus  sniffing  snooping  interception  lawful-interception  li  tapping 
august 2013 by jm
The Architecture Twitter Uses to Deal with 150M Active Users, 300K QPS, a 22 MB/S Firehose, and Send Tweets in Under 5 Seconds
Good read.
Twitter is primarily a consumption mechanism, not a production mechanism. 300K QPS are spent reading timelines and only 6000 requests per second are spent on writes.


* their approach of precomputing the timeline for the non-search case is a good example of optimizing for the more frequently-exercised path.

* MySQL and Redis are the underlying stores. Redis is acting as a front-line in-RAM cache. they're pretty happy with it: https://news.ycombinator.com/item?id=6011254

* these further talks go into more detail, apparently (haven't watched them yet):

http://www.infoq.com/presentations/Real-Time-Delivery-Twitter
http://www.infoq.com/presentations/Twitter-Timeline-Scalability
http://www.infoq.com/presentations/Timelines-Twitter

* funny thread of comments on HN, from a big-iron fan: https://news.ycombinator.com/item?id=6008228
scale  architecture  scalability  twitter  high-scalability  redis  mysql 
july 2013 by jm
Facebook announce Wormhole
Over the last couple of years, we have built and deployed a reliable publish-subscribe system called Wormhole. Wormhole has become a critical part of Facebook's software infrastructure. At a high level, Wormhole propagates changes issued in one system to all systems that need to reflect those changes – within and across data centers.


Facebook's Kafka-alike, basically, although with some additional low-latency guarantees. FB appear to be using it for multi-region and multi-AZ replication. Proprietary.
pub-sub  scalability  facebook  realtime  low-latency  multi-region  replication  multi-az  wormhole 
june 2013 by jm
Building a Modern Website for Scale (QCon NY 2013) [slides]
some great scalability ideas from LinkedIn. Particularly interesting are the best practices suggested for scaling web services:

1. store client-call timeouts and SLAs in Zookeeper for each REST endpoint;
2. isolate backend calls using async/threadpools;
3. cancel work on failures;
4. avoid sending requests to GC'ing hosts;
5. rate limits on the server.

#4 is particularly cool. They do this using a "GC scout" request before every "real" request; a cheap TCP request to a dedicated "scout" Netty port, which replies near-instantly. If it comes back with a 1-packet response within 1 millisecond, send the real request, else fail over immediately to the next host in the failover set.

There's still a potential race condition where the "GC scout" can be achieved quickly, then a GC starts just before the "real" request is issued. But the incidence of GC-blocking-request is probably massively reduced.

It also helps against packet loss on the rack or server host, since packet loss will cause the drop of one of the TCP packets, and the TCP retransmit timeout will certainly be higher than 1ms, causing the deadline to be missed. (UDP would probably work just as well, for this reason.) However, in the case of packet loss in the client's network vicinity, it will be vital to still attempt to send the request to the final host in the failover set regardless of a GC-scout failure, otherwise all requests may be skipped.

The GC-scout system also helps balance request load off heavily-loaded hosts, or hosts with poor performance for other reasons; they'll fail to achieve their 1 msec deadline and the request will be shunted off elsewhere.

For service APIs with real low-latency requirements, this is a great idea.
gc-scout  gc  java  scaling  scalability  linkedin  qcon  async  threadpools  rest  slas  timeouts  networking  distcomp  netty  tcp  udp  failover  fault-tolerance  packet-loss 
june 2013 by jm
Martin Thompson, Luke "Snabb Switch" Gorrie etc. review the C10M presentation from Schmoocon
on the mechanical-sympathy mailing list. Some really interesting discussion on handling insane quantities of TCP connections using low volumes of hardware:
This talk has some good points and I think the subject is really interesting.  I would take the suggested approach with serious caution.  For starters the Linux kernel is nowhere near as bad as it made out.  Last year I worked with a client and we scaled a single server to 1 million concurrent connections with async programming in Java and some sensible kernel tuning.  I've heard they have since taken this to over 5 million concurrent connections.

BTW Open Onload is an open source implementation.  Writing a network stack is a serious undertaking.  In a previous life I wrote a network probe and had to reassemble TCP streams and kept getting tripped up by edge cases.  It is a great exercise in data structures and lock-free programming.  If you need very high-end performance I'd talk to the Solarflare or Mellanox guys before writing my own.

There are some errors and omissions in this talk.  For example, his range of ephemeral ports is not quite right, and atomic operations are only 15 cycles on Sandy Bridge when hitting local cache.  A big issue for me is when he defined C10M he did not mention the TIME_WAIT issue with closing connections.  Creating and destroying 1 million connections per second is a major issue.  A protocol like HTTP is very broken in that the server closes the socket and therefore has to retain the TCB until the specified timeout occurs to ensure no older packet is delivered to a new socket connection.
mechanical-sympathy  hardware  scaling  c10m  tcp  http  scalability  snabb-switch  martin-thompson 
may 2013 by jm
CAP Confusion: Problems with ‘partition tolerance’
Another good clarification about CAP which resurfaced during last week's discussion:
So what causes partitions? Two things, really. The first is obvious – a network failure, for example due to a faulty switch, can cause the network to partition. The other is less obvious, but fits with the definition [...]: machine failures, either hard or soft. In an asynchronous network, i.e. one where processing a message could take unbounded time, it is impossible to distinguish between machine failures and lost messages. Therefore a single machine failure partitions it from the rest of the network. A correlated failure of several machines partitions them all from the network. Not being able to receive a message is the same as the network not delivering it. In the face of sufficiently many machine failures, it is still impossible to maintain availability and consistency, not because two writes may go to separate partitions, but because the failure of an entire ‘quorum’ of servers may render some recent writes unreadable.

(sorry, catching up on old interesting things posted last week...)
failure  scalability  network  partitions  cap  quorum  distributed-databases  fault-tolerance 
may 2013 by jm
Alex Feinberg's response to Damien Katz' anti-Dynamoish/pro-Couchbase blog post
Insightful response, worth bookmarking. (the original post is at http://damienkatz.net/2013/05/dynamo_sure_works_hard.html ).
while you are saving on read traffic (online reads only go to the master), you are now decreasing availability (contrary to your stated goal), and increasing system complexity.
You also do hurt performance by requiring all writes and reads to be serialized through a single node: unless you plan to have a leader election whenever the node fails to meet a read SLA (which is going to result a disaster -- I am speaking from personal experience), you will have to accept that you're bottlenecked by a single node. With a Dynamo-style quorum (for either reads or writes), a single straggler will not reduce whole-cluster latency.
The core point of Dynamo is low latency, availability and handling of all kinds of partitions: whether clean partitions (long term single node failures), transient failures (garbage collection pauses, slow disks, network blips, etc...), or even more complex dependent failures.
The reality, of course, is that availability is neither the sole, nor the principal concern of every system. It's perfect fine to trade off availability for other goals -- you just need to be aware of that trade off.
cap  distributed-databases  databases  quorum  availability  scalability  damien-katz  alex-feinberg  partitions  network  dynamo  riak  voldemort  couchbase 
may 2013 by jm
DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second
250 million tweets per day, 30-node HBase cluster, 400TB of storage, Kafka and 0mq.

This is from 2011, hence this dated line: 'for a distributed application they thought AWS was too limited, especially in the network. AWS doesn’t do well when nodes are connected together and they need to talk to each other. Not low enough latency network. Their customers care about latency.' (Nowadays, it would be damn hard to build a lower-latency network than that attached to a cc2.8xlarge instance.)
datasift  architecture  scalability  data  twitter  firehose  hbase  kafka  zeromq 
april 2013 by jm
Latency's Worst Nightmare: Performance Tuning Tips and Tricks [slides]
the basics of running a service stack (web, app servers, data stores) on AWS. some good benchmark figures in the final slides
benchmarks  aws  ec2  ebs  piops  services  scaling  scalability  presentations 
april 2013 by jm
High Scalability - Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years
wow, Pinterest have a pretty hardcore architecture. Sharding to the max. This is scary stuff for me:
a [Cassandra-style] Cluster Management Algorithm is a SPOF. If there’s a bug it impacts every node. This took them down 4 times.


yeah, so, eek ;)
clustering  sharding  architecture  aws  scalability  scaling  pinterest  via:matt-sergeant  redis  mysql  memcached 
april 2013 by jm
From a monolithic Ruby on Rails app to the JVM
How Soundcloud have ditched the monolithic Rails for nimbler, small-scale distributed polyglot services running on the JVM
soundcloud  rails  slides  jvm  scalability  ruby  scala  clojure  coding 
march 2013 by jm
High Scalability - Analyzing billions of credit card transactions and serving low-latency insights in the cloud
Hadoop, a batch-generated read-only Voldemort cluster, and an intriguing optimal-storage histogram bucketing algorithm:
The optimal histogram is computed using a random-restart hill climbing approximated algorithm.
The algorithm has been shown very fast and accurate: we achieved 99% accuracy compared to an exact dynamic algorithm, with a speed increase of one factor. [...] The amount of information to serve in Voldemort for one year of BBVA's credit card transactions on Spain is 270 GB. The whole processing flow would run in 11 hours on a cluster of 24 "m1.large" instances. The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month.
scalability  scaling  voldemort  hadoop  batch  algorithms  histograms  statistics  bucketing  percentiles 
february 2013 by jm
DuckDuckGo Architecture - 1 Million Deep Searches a Day and Growing
thumbs-up for DNSMadeEasy's Global Traffic Director anycast-based geographically-segmented DNS service, in particular
dns  architecture  scalability  search  duckduckgo  geoip  anycast 
january 2013 by jm
The innards of Evernote's new business analytics data warehouse
replacing a giant MySQL star-schema reporting server with a Hadoop/Hive/ParAccel cluster
horizontal-scaling  scalability  bi  analytics  reporting  evernote  via:highscalability  hive  hadoop  paraccel 
december 2012 by jm
HTTP Error 403: The service you requested is restricted - Vodafone Community
Looks like Vodafone Ireland are failing to scale their censorware; clients on their network reporting "HTTP Error 403: The service you requested is restricted". According to a third-party site, this error is produced by the censorship software they use when it's insufficiently scaled for demand:

"When you try to use HTTP Vodafone route a request to their authentication server to see if your account is allow to connect to the site. By default they block a list of adult/premium web sites (this is service you have switched on or off with your account). The problem is at busy times this validation service is overloaded and so their systems get no response as to whether the site is allowed, so assume the site you asked for is restricted and gives the 403 error. Once this happens you seem to have to make new 3G data connection (reset the phone, move cell or let the connection time out) to get it to try again."


Sample: http://pic.twitter.com/N1lAwBjW
scaling  ireland  vodafone  fail  censorware  scalability  customer-service 
november 2012 by jm
Tumblr Architecture - 15 Billion Page Views A Month And Harder To Scale Than Twitter
Buckets of details on Tumblr's innards. fans of Finagle and Kafka, notably
tumblr  scalability  web  finagle  redis  kafka 
november 2012 by jm
Spanner: Google's Globally-Distributed Database [PDF]

Abstract: Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

To appear in:
OSDI'12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October, 2012.
database  distributed  google  papers  toread  pdf  scalability  distcomp  transactions  cap  consistency 
september 2012 by jm
Evolution of SoundCloud's Architecture
nice write-up. nginx, Rails, RabbitMQ, MySQL, Cassandra, Elastic Search, HAProxy
soundcloud  webdev  architecture  scaling  scalability 
september 2012 by jm
Scaling lessons learned at Dropbox
website-scaling tips and suggestions, "particularly for a resource-constrained, fast-growing environment that can’t always afford to do things “the right way” (i.e., any real-world engineering project". I really like the "run with fake load" trick; add additional queries/load which you can quickly turn off if the service starts browning out, giving you a few days breathing room to find a real fix before customers start being affected. Neat
dropbox  scalability  webdev  load  scaling-up 
july 2012 by jm
Scale Something: How Draw Something rode its rocket ship of growth
Membase, surprise answer. In general it sounds like they had a pretty crazy time -- rebuilding the plane in flight even more than usual. "This had us on our toes and working 24 hours a day. I think at one point we were up for around 60-plus hours straight, never leaving the computer. We had to scale out web servers using DNS load balancing, we had to get multiple HAProxies, break tables off MySQL to their own databases, transparently shard tables, and more. This was all being done on demand, live, and usually in the middle of the night. We were very lucky that most of our layers were scalable with little or no major modifications needed. Helping us along the way was our very detailed custom server monitoring tools which allowed us to keep a very close eye on load, memory, and even provided real time usage stats on the game which helped with capacity planning. We eventually ended up with easy to launch "clusters" of our app that included NGINX, HAProxy, and Goliath servers all of which independent of everything else and when launched, increased our capacity by a constant. At this point our drawings per second were in the thousands, and traffic that looked huge a week ago was just a small bump on the current graphs."
scale  scalability  draw-something  games  haproxy  mysql  membase  couchbase 
april 2012 by jm
Storage Infrastructure Behind Facebook Messages
HBase and Haystack; all data LZO-compressed; very interesting approach to testing -- they 'shadow the real production workload into the test cluster to test before going into production'. This catches a 'high percentage' of issues before production. nice
testing  shadowing  haystack  hbase  facebook  scalability  lzo  messaging  sms  via:james-hamilton 
october 2011 by jm
Storm
'The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to be. There's no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.

However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a "Hadoop of realtime" has become the biggest hole in the data processing ecosystem. Storm fills that hole.'
data  scaling  twitter  realtime  scalability  storm  queueing 
september 2011 by jm
good taxonomy of memcached use cases
via Jeff Barr's announcement of the Elasticache launch. from 2008, but a better taxonomy than I've seen elsewhere
memcached  caching  mysql  performance  scalability  via:jeffbarr 
august 2011 by jm
The Secrets of Building Realtime Big Data Systems
great slides, via HN. recommends a canonical Hadoop long-term store and a quick, realtime, separate datastore for "not yet processed by Hadoop" data
hadoop  big-data  data  scalability  datamining  realtime  slides  presentations 
may 2011 by jm
Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day
Scribe logs events, "ptail" (parallel tail presumably) tails logs from Scribe stores, Puma batch-aggregates, writes to HBase.  Java and Thrift on the backend, PHP in front
facebook  hbase  scalability  performance  hadoop  scribe  events  analytics  architecture  tail  append  from delicious
march 2011 by jm
Akka
'platform for event-driven, scalable, and fault-tolerant architectures on the JVM' .. Actor-based, 'let-it-crash', Apache-licensed, Java and Scala APIs, remote Actors, transactional memory -- looks quite nice
scala  java  concurrency  scalability  apache  akka  actors  erlang  fault-tolerance  events  from delicious
march 2011 by jm
Thousands of Threads and Blocking I/O [PDF]
classic presentation from Paul Tyma of Mailinator regarding the java.nio (event-driven, non-threaded) vs java.io (threaded) model of server concurrency, backing up the scalability of threads on modern JVMs
java  async  io  jvm  linux  performance  scalability  threading  threads  server  nio  paul-tyma  mailinator  from delicious
july 2010 by jm
How do we kick our synchronous addiction?
great post on the hazards of programming in an async framework, and how damn hard it is. good comments thread too (via jzawodny)
via:jzawodny  coding  python  javascript  scalability  ruby  concurrency  erlang  async  node.js  twisted  from delicious
february 2010 by jm
What Second Life can teach your datacenter about scaling Web apps
good scaling advice from Linden Labs' Ian Wilkes (who doesn't seem to have a blog, sadly)
linden  ian-wilkes  scaling  datacenters  scalability  deployment  ops  services  from delicious
february 2010 by jm
« earlier      
per page:    204080120160

related tags

actors  ajax  akka  alex-feinberg  algorithms  amdahls-law  analytics  anycast  apache  api  append  architecture  articles  asg  async  atomic  aurora  auto-scaling  autoscaling  availability  aws  baron-schwartz  batch  ben-stopford  benchmarking  benchmarks  bgp  bi  big-data  bigtable  boost  bucketing  build  c10m  cache  caches  caching  cap  cap-theorem  capacity  carbon  cas  cassandra  censorware  cep  cherami  clojure  cloud-dataflow  cloudformation  clustering  coda-hale  coding  columnar-stores  comet  comsat  concurrency  conferences  consistency  constant-load  consul  contention  couchbase  counters  cpus  cqrs  crdts  customer-service  damien-katz  data  data-processing  data-structures  database  databases  datacenters  datamining  datasift  deployment  design  devops  disqus  distcomp  distributed  distributed-databases  dns  draw-something  dropbox  druid  duckduckgo  dyn  dynamo  dynamodb  ebs  ec2  elasticsearch  epoll  equations  erlang  event-processing  events  evernote  exabgp  excel  facebook  fail  failover  failure  fault-tolerance  fifo  finagle  firehose  freebsd  friendfeed  funny  games  gc  gc-scout  geographical  geoip  gevent  gfs  giraph  gizzard  go  google  google-io  graph  graph-processing  graphite  graphs  hadoop  haproxy  hardware  hashed-wheel-timers  haystack  hbase  heron  high-scalability  histograms  hive  hn  horizontal-scaling  horror  http  https  ian-wilkes  images  ingestion  inmobi  innodb  interception  io  ip  ireland  java  javascript  jenkins  juniper  jvm  kafka  l7  lambda-architecture  languages  latency  lawful-interception  lbs  leveldb  li  librato  linden  linkedin  linux  littles-law  load  load-balancing  loading  locking  locks  logging  loggly  lol  long-poll  low-latency  lru  lzo  machine-learning  mailinator  map-reduce  martin-thompson  measurement  mechanical-sympathy  membase  memcached  messaging  metrics  microservices  millwheel  modelling  mongrel2  multi-az  multi-dc  multi-region  multicast  multicore  multifeed  mysql  nagle  narus  netty  network  networking  nginx  nio  node.js  nosql  nsa  open-source  opensource  ops  optimization  oreilly  ospf  overload  packet-capture  packet-loss  pagerank  panopticon  papers  paraccel  parquet  partitions  paul-tyma  pdf  percentiles  performance  peter-bailis  photos  php  pinterest  piops  poll  presentations  production  protocols  provisioning  pub-sub  pulsar  python  qcon  quasar  queueing  queues  quorum  rails  ramp  rdd  reading  realtime  redis  reliability  reliabilty  replication  reporting  rest  riak  rocksdb  route53  routing  ruby  scala  scalability  scale  scaling  scaling-up  schedulers  scribe  search  server  server-side  service-discovery  services  shadowing  sharding  shutterstock  slas  slides  smartstack  sms  snabb-switch  sniffing  snooping  soa  soundcloud  spanner  spark  spot-instances  sql  sqs  state  statistics  storage  storm  stream-processing  streaming  streams  stripe  synchronization  system-dynamics  systems  tail  talks  tapping  tasks  tcp  tellybug  testing  threading  threadpools  threads  time-series  timeouts  timers  to-read  toread  tornado  transactions  trivago  tsd  tumblr  twisted  twitter  uber  udp  unladen-swallow  urs-holzle  usl  velocity  via:brianscanlan  via:fanf  via:highscalability  via:james-hamilton  via:jeffbarr  via:jzawodny  via:matt-sergeant  via:preddit  video  vips  vodafone  voldemort  web  webdev  whatsapp  whisper  worker-pools  wormhole  xkeyscore  zedshaw  zeromq 

Copy this bookmark:



description:


tags: