jm + scaling   69

Locking, Little's Law, and the USL
Excellent explanatory mailing list post by Martin Thompson to the mechanical-sympathy group, discussing Little's Law vs the USL:
Little's law can be used to describe a system in steady state from a queuing perspective, i.e. arrival and leaving rates are balanced. In this case it is a crude way of modelling a system with a contention percentage of 100% under Amdahl's law, in that throughput is one over latency.

However this is an inaccurate way to model a system with locks. Amdahl's law does not account for coherence costs. For example, if you wrote a microbenchmark with a single thread to measure the lock cost then it is much lower than in a multi-threaded environment where cache coherence, other OS costs such as scheduling, and lock implementations need to be considered.

Universal Scalability Law (USL) accounts for both the contention and the coherence costs.
http://www.perfdynamics.com/Manifesto/USLscalability.html

When modelling locks it is necessary to consider how contention and coherence costs vary given how they can be implemented. Consider in Java how we have biased locking, thin locks, fat locks, inflation, and revoking biases which can cause safe points that bring all threads in the JVM to a stop with a significant coherence component.
usl  scaling  scalability  performance  locking  locks  java  jvm  amdahls-law  littles-law  system-dynamics  modelling  systems  caching  threads  schedulers  contention 
11 weeks ago by jm
cristim/autospotting
'Easy to use tool that automatically replaces some or even all on-demand AutoScaling group members with similar or larger identically configured spot instances in order to generate significant cost savings on AWS EC2, behaving much like an AutoScaling-backed spot fleet.'
asg  autoscaling  ec2  aws  spot-fleet  spot-instances  cost-saving  scaling 
august 2017 by jm
Scaling Amazon Aurora at ticketea
Ticketing is a business in which extreme traffic spikes are the norm, rather than the exception. For Ticketea, this means that our traffic can increase by a factor of 60x in a matter of seconds. This usually happens when big events (which have a fixed, pre-announced 'sale start time') go on sale.
scaling  scalability  ops  aws  aurora  autoscaling  asg 
may 2017 by jm
How eBay’s Shopping Cart used compression techniques to solve network I/O bottlenecks
compressing data written to MongoDB using LZ4_HIGH --dropped oplog write rates from 150GB/hour to 11GB/hour. Snappy and Gzip didn't fare too well by comparison
lz4  compression  gzip  json  snappy  scaling  ebay  mongodb 
february 2017 by jm
AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205) - YouTube
Yelp talk about their Spot Fleet price optimization autoscaler app, FleetMiser
yelp  scaling  aws  spot-fleet  ops  spot-instances  money 
december 2016 by jm
Auto scaling Pinterest
notes on a second-system take on autoscaling -- Pinterest tried it once, it didn't take, and this is the rerun. I like the tandem ASG approach (spots and nonspots)
spot-instances  scaling  aws  scalability  ops  architecture  pinterest  via:highscalability 
december 2016 by jm
Kafka Streams - Scaling up or down
this is a nice zero-config scaling story -- good work Kafka Streams
scaling  scalability  architecture  kafka  streams  ops 
october 2016 by jm
Auto Scaling for EC2 Spot Fleets
'we are enhancing the Spot Fleet model with the addition of Auto Scaling. You can now arrange to scale your fleet up and down based on a Amazon CloudWatch metric. The metric can originate from an AWS service such as EC2, Amazon EC2 Container Service, or Amazon Simple Queue Service (SQS). Alternatively, your application can publish a custom metric and you can use it to drive the automated scaling.'
asg  auto-scaling  ec2  spot-fleets  ops  scaling 
september 2016 by jm
Uploading, Resizing and Serving images with Google Cloud Platform — Google Cloud Platform — Community — Medium
Cropping, scaling, and resizing images on the fly, for free, with GAE. Great service, wish AWS had something similar
App Engine API has a very useful function to extract a magic URL for serving the images when uploaded into the Cloud Storage. get_serving_url() returns a URL that serves the image in a format that allows dynamic resizing and cropping, so you don’t need to store different image sizes on the server. Images are served with low latency from a highly optimized, cookieless infrastructure.
gae  google  app-engine  images  scaling  cropping  image-processing  thumbnails  google-cloud 
may 2016 by jm
Topics in High-Performance Messaging
'We have worked together in the field of high-performance messaging for many years, and in that time, have seen some messaging systems that worked well and some that didn't. Successful deployment of a messaging system requires background information that is not easily available; most of what we know, we had to learn in the school of hard knocks. To save others a knock or two, we have collected here the essential background information and commentary on some of the issues involved in successful deployments. This information is organized as a series of topics around which there seems to be confusion or uncertainty. Please contact us if you have questions or comments.'
messaging  scalability  scaling  performance  udp  tcp  protocols  multicast  latency 
december 2015 by jm
Designing the Spotify perimeter
How Spotify use nginx as a frontline for their sites and services
scaling  spotify  nginx  ops  architecture  ssl  tls  http  frontline  security 
october 2015 by jm
A simple guide to 9-patch for Android UI
This is a nifty hack. TIL!

'9-patch uses png transparency to do an advanced form of 9-slice or scale9. The guides are straight, 1-pixel black lines drawn on the edge of your image that define the scaling and fill of your image. By naming your image file name.9.png, Android will recognize the 9.png format and use the black guides to scale and fill your bitmaps.'
android  design  9-patch  scaling  images  bitmaps  scale9  9-slice  ui  graphics 
july 2015 by jm
Elements of Scale: Composing and Scaling Data Platforms
Great, encyclopedic blog post rounding up common architectural and algorithmic patterns using in scalable data platforms. Cut out and keep!
architecture  storage  databases  data  big-data  scaling  scalability  ben-stopford  cqrs  druid  parquet  columnar-stores  lambda-architecture 
may 2015 by jm
How We Scale VividCortex's Backend Systems - High Scalability
Excellent post from Baron Schwartz about their large-scale, 1-second-granularity time series database storage system
time-series  tsd  storage  mysql  sql  baron-schwartz  ops  performance  scalability  scaling  go 
march 2015 by jm
AWS re:Invent 2014 | (SPOT302) Under the Covers of AWS: Its Core Distributed Systems - YouTube
This is a really solid talk -- not surprising, alv@ is one of the speakers!
"AWS and Amazon.com operate some of the world's largest distributed systems infrastructure and applications. In our past 18 years of operating this infrastructure, we have come to realize that building such large distributed systems to meet the durability, reliability, scalability, and performance needs of AWS requires us to build our services using a few common distributed systems primitives. Examples of these primitives include a reliable method to build consensus in a distributed system, reliable and scalable key-value store, infrastructure for a transactional logging system, scalable database query layers using both NoSQL and SQL APIs, and a system for scalable and elastic compute infrastructure.

In this session, we discuss some of the solutions that we employ in building these primitives and our lessons in operating these systems. We also cover the history of some of these primitives -- DHTs, transactional logging, materialized views and various other deep distributed systems concepts; how their design evolved over time; and how we continue to scale them to AWS. "


Slides: http://www.slideshare.net/AmazonWebServices/spot302-under-the-covers-of-aws-core-distributed-systems-primitives-that-power-our-platform-aws-reinvent-2014
scale  scaling  aws  amazon  dht  logging  data-structures  distcomp  via:marc-brooker  dynamodb  s3 
november 2014 by jm
Facebook's datacenter fabric
FB goes public with its take on the Clos network-based datacenter network architecture
networking  scaling  facebook  clos-networks  fabrics  datacenters  network-architecture 
november 2014 by jm
Doing Constant Work to Avoid Failures
A good example of a design pattern -- by performing a relatively constant amount of work regardless of the input, we can predict scalability and reduce the risk of overload when something unexpected changes in that input
scalability  scaling  architecture  aws  route53  via:brianscanlan  overload  constant-load  loading 
november 2014 by jm
mcrouter: A memcached protocol router for scaling memcached deployments
New from Facebook engineering:
Last year, at the Data@Scale event and at the USENIX Networked Systems Design and Implementation conference , we spoke about turning caches into distributed systems using software we developed called mcrouter (pronounced “mick-router”). Mcrouter is a memcached protocol router that is used at Facebook to handle all traffic to, from, and between thousands of cache servers across dozens of clusters distributed in our data centers around the world. It is proven at massive scale — at peak, mcrouter handles close to 5 billion requests per second. Mcrouter was also proven to work as a standalone binary in an Amazon Web Services setup when Instagram used it last year before fully transitioning to Facebook's infrastructure.

Today, we are excited to announce that we are releasing mcrouter’s code under an open-source BSD license. We believe it will help many sites scale more easily by leveraging Facebook’s knowledge about large-scale systems in an easy-to-understand and easy-to-deploy package.


This is pretty crazy -- basically turns a memcached cluster into a much more usable clustered-storage system, with features like shadowing production traffic, cold cache warmup, online reconfiguration, automatic failover, prefix-based routing, replicated pools, etc. Lots of good features.
facebook  scaling  cache  proxy  memcache  open-source  clustering  distcomp  storage 
september 2014 by jm
Inside Apple’s Live Event Stream Failure, And Why It Happened: It Wasn’t A Capacity Issue
The bottom line with this event is that the encoding, translation, JavaScript code, the video player, the call to S3 single storage location and the millisecond refreshes all didn’t work properly together and was the root cause of Apple’s failed attempt to make the live stream work without any problems. So while it would be easy to say it was a CDN capacity issue, which was my initial thought considering how many events are taking place today and this week, it does not appear that a lack of capacity played any part in the event not working properly. Apple simply didn’t provision and plan for the event properly.
cdn  streaming  apple  fail  scaling  s3  akamai  caching 
september 2014 by jm
How Twitter Uses Redis to Scale
'105TB RAM, 39MM QPS, 10,000+ instances.' Notes from a talk given by Yao Yu of Twitter's Cache team, where she's worked for 4 years. Lots of interesting insights into large-scale Redis caching usage -- as in, large enough to max out the cluster hosts' network bandwidth.
twitter  redis  caching  memcached  yao-yu  scaling 
september 2014 by jm
Fighting spam with BotMaker
Some vague details of the antispam system in use at Twitter.
The main challenges in supporting this type of system are evaluating rules with low enough latency that they can run on the write path for Twitter’s main features (i.e., Tweets, Retweets, favorites, follows and messages), supporting computationally intense machine learning based rules, and providing Twitter engineers with the ability to modify and create new rules instantaneously.
spam  realtime  scaling  twitter  anti-spam  botmaker  rules 
august 2014 by jm
New Low Cost EC2 Instances with Burstable Performance
Oh, very neat. New micro, small, and medium-class instances with burstable CPU scaling:
The T2 instances are built around a processing allocation model that provides you a generous, assured baseline amount of processing power coupled with the ability to automatically and transparently scale up to a full core when you need more compute power. Your ability to burst is based on the concept of "CPU Credits" that you accumulate during quiet periods and spend when things get busy. You can provision an instance of modest size and cost and still have more than adequate compute power in reserve to handle peak demands for compute power.
ec2  aws  hosting  cpu  scaling  burst  load  instances 
july 2014 by jm
Google's Pegasus
a power-management subsystem for warehouse-scale computing farms. "It adjusts the power-performance settings of servers so that the overall workload barely meets its latency constraints for user queries."
pegasus  power-management  power  via:fanf  google  latency  scaling 
june 2014 by jm
Shutterbits replacing hardware load balancers with local BGP daemons and anycast
Interesting approach. Potentially risky, though -- heavy use of anycast on a large-scale datacenter network could increase the scale of the OSPF graph, which scales exponentially. This can have major side effects on OSPF reconvergence time, which creates an interesting class of network outage in the event of OSPF flapping.

Having said that, an active/passive failover LB pair will already announce a single anycast virtual IP anyway, so, assuming there are a similar number of anycast IPs in the end, it may not have any negative side effects.

There's also the inherent limitation noted in the second-to-last paragraph; 'It comes down to what your hardware router can handle for ECMP. I know a Juniper MX240 can handle 16 next-hops, and have heard rumors that a software update will bump this to 64, but again this is something to keep in mind'. Taking a leaf from the LB design, and using BGP to load-balance across a smaller set of haproxy instances, would seem like a good approach to scale up.
scalability  networking  performance  load-balancing  bgp  exabgp  ospf  anycast  routing  datacenters  scaling  vips  juniper  haproxy  shutterstock 
may 2014 by jm
Docker Plugin for Jenkins
The aim of the docker plugin is to be able to use a docker host to dynamically provision a slave, run a single build, then tear-down that slave. Optionally, the container can be committed, so that (for example) manual QA could be performed by the container being imported into a local docker provider, and run from there.


The holy grail of Jenkins/Docker integration. How cool is that...
jenkins  docker  ops  testing  ec2  hosting  scaling  elastic-scaling  system-testing 
may 2014 by jm
Druid | How We Scaled HyperLogLog: Three Real-World Optimizations
3 optimizations Druid.io have made to the HLL algorithm to scale it up for production use in Metamarkets: compacting registers (fixes a bug with unions of multiple HLLs); a sparse storage format (to optimize space); faster lookups using a lookup table.
druid.io  metamarkets  scaling  hyperloglog  hll  algorithms  performance  optimization  counting  estimation 
april 2014 by jm
'Scaling to Millions of Simultaneous Connections' [pdf]
Presentation by Rick Reed of WhatsApp on the large-scale Erlang cluster backing the WhatsApp API, delivered at Erlang Factory SF, March 30 2012. lots of juicy innards here
erlang  scaling  scalability  performance  whatsapp  freebsd  presentations 
february 2014 by jm
Git is not scalable with too many refs/*
Mailing list thread from 2011; git starts to keel over if you tag too much
git  tags  coding  version-control  bugs  scaling  refs 
february 2014 by jm
Cassandra: tuning the JVM for read heavy workloads
The cluster we tuned is hosted on AWS and is comprised of 6 hi1.4xlarge EC2 instances, with 2 1TB SSDs raided together in a raid 0 configuration. The cluster’s dataset is growing steadily. At the time of this writing, our dataset is 341GB, up from less than 200GB a few months ago, and is growing by 2-3GB per day. The workload on this cluster is very read heavy, with quorum reads making up 99% of all operations.


Some careful GC tuning here. Probably not applicable to anyone else, but good approach in general.
java  performance  jvm  scaling  gc  tuning  cassandra  ops 
january 2014 by jm
Scryer: Netflix’s Predictive Auto Scaling Engine
Scryer is a new system that allows us to provision the right number of AWS instances needed to handle the traffic of our customers. But Scryer is different from Amazon Auto Scaling (AAS), which reacts to real-time metrics and adjusts instance counts accordingly. Rather, Scryer predicts what the needs will be prior to the time of need and provisions the instances based on those predictions.
scaling  infrastructure  aws  ec2  netflix  scryer  auto-scaling  aas  metrics  prediction  spikes 
november 2013 by jm
Don't use Hadoop - your data isn't that big
see also HN comments: https://news.ycombinator.com/item?id=6398650 , particularly davidmr's great one:

I suppose all of this is to say that the amount of required parallelization of a problem isn't necessarily related to the size of the problem set as is mentioned most in the article, but also the inherent CPU and IO characteristics of the problem. Some small problems are great for large-scale map-reduce clusters, some huge problems are horrible for even bigger-scale map-reduce clusters (think fluid dynamics or something that requires each subdivision of the problem space to communicate with its neighbors).
I've had a quote printed on my door for years: Supercomputers are an expensive tool for turning CPU-bound problems into IO-bound problems.


I love that quote!
hadoop  big-data  scaling  map-reduce 
september 2013 by jm
Interview with the Github Elasticsearch Team
good background on Github's Elasticsearch scaling efforts. Some rather horrific split-brain problems under load, and crashes due to OpenJDK bugs (sounds like OpenJDK *still* isn't ready for production). painful
elasticsearch  github  search  ops  scaling  split-brain  outages  openjdk  java  jdk  jvm 
september 2013 by jm
Why wireless mesh networks won't save us from censorship
I'm not saying mesh networks don't work ever; the people in the wireless mesh community I've met are all great people doing fantastic work. What I am saying is that unplanned wireless mesh networks never work at scale. I think it's a great problem to think about, but in terms of actual allocation of time and resources I think there are other, more fruitful avenues of action to fight Internet censorship.


(via Kragen)
wireless  censorship  internet  networking  mesh  mesh-networks  organisation  scaling  wifi 
august 2013 by jm
New Tweets per second record, and how | Twitter Blog
How Twitter scaled up massively in 3 years -- replacing Ruby with the JVM, adopting SOA and custom sharding. Good summary post, looking forward to more techie details soon
twitter  performance  scalability  jvm  ruby  soa  scaling 
august 2013 by jm
Building a Modern Website for Scale (QCon NY 2013) [slides]
some great scalability ideas from LinkedIn. Particularly interesting are the best practices suggested for scaling web services:

1. store client-call timeouts and SLAs in Zookeeper for each REST endpoint;
2. isolate backend calls using async/threadpools;
3. cancel work on failures;
4. avoid sending requests to GC'ing hosts;
5. rate limits on the server.

#4 is particularly cool. They do this using a "GC scout" request before every "real" request; a cheap TCP request to a dedicated "scout" Netty port, which replies near-instantly. If it comes back with a 1-packet response within 1 millisecond, send the real request, else fail over immediately to the next host in the failover set.

There's still a potential race condition where the "GC scout" can be achieved quickly, then a GC starts just before the "real" request is issued. But the incidence of GC-blocking-request is probably massively reduced.

It also helps against packet loss on the rack or server host, since packet loss will cause the drop of one of the TCP packets, and the TCP retransmit timeout will certainly be higher than 1ms, causing the deadline to be missed. (UDP would probably work just as well, for this reason.) However, in the case of packet loss in the client's network vicinity, it will be vital to still attempt to send the request to the final host in the failover set regardless of a GC-scout failure, otherwise all requests may be skipped.

The GC-scout system also helps balance request load off heavily-loaded hosts, or hosts with poor performance for other reasons; they'll fail to achieve their 1 msec deadline and the request will be shunted off elsewhere.

For service APIs with real low-latency requirements, this is a great idea.
gc-scout  gc  java  scaling  scalability  linkedin  qcon  async  threadpools  rest  slas  timeouts  networking  distcomp  netty  tcp  udp  failover  fault-tolerance  packet-loss 
june 2013 by jm
Martin Thompson, Luke "Snabb Switch" Gorrie etc. review the C10M presentation from Schmoocon
on the mechanical-sympathy mailing list. Some really interesting discussion on handling insane quantities of TCP connections using low volumes of hardware:
This talk has some good points and I think the subject is really interesting.  I would take the suggested approach with serious caution.  For starters the Linux kernel is nowhere near as bad as it made out.  Last year I worked with a client and we scaled a single server to 1 million concurrent connections with async programming in Java and some sensible kernel tuning.  I've heard they have since taken this to over 5 million concurrent connections.

BTW Open Onload is an open source implementation.  Writing a network stack is a serious undertaking.  In a previous life I wrote a network probe and had to reassemble TCP streams and kept getting tripped up by edge cases.  It is a great exercise in data structures and lock-free programming.  If you need very high-end performance I'd talk to the Solarflare or Mellanox guys before writing my own.

There are some errors and omissions in this talk.  For example, his range of ephemeral ports is not quite right, and atomic operations are only 15 cycles on Sandy Bridge when hitting local cache.  A big issue for me is when he defined C10M he did not mention the TIME_WAIT issue with closing connections.  Creating and destroying 1 million connections per second is a major issue.  A protocol like HTTP is very broken in that the server closes the socket and therefore has to retain the TCB until the specified timeout occurs to ensure no older packet is delivered to a new socket connection.
mechanical-sympathy  hardware  scaling  c10m  tcp  http  scalability  snabb-switch  martin-thompson 
may 2013 by jm
Latency's Worst Nightmare: Performance Tuning Tips and Tricks [slides]
the basics of running a service stack (web, app servers, data stores) on AWS. some good benchmark figures in the final slides
benchmarks  aws  ec2  ebs  piops  services  scaling  scalability  presentations 
april 2013 by jm
High Scalability - Scaling Pinterest - From 0 to 10s of Billions of Page Views a Month in Two Years
wow, Pinterest have a pretty hardcore architecture. Sharding to the max. This is scary stuff for me:
a [Cassandra-style] Cluster Management Algorithm is a SPOF. If there’s a bug it impacts every node. This took them down 4 times.


yeah, so, eek ;)
clustering  sharding  architecture  aws  scalability  scaling  pinterest  via:matt-sergeant  redis  mysql  memcached 
april 2013 by jm
Hadoop Operations at LinkedIn [slides]
another good Hadoop-at-scale presentation, from LI this time
hadoop  scaling  linkedin  ops 
march 2013 by jm
Timelike 2: everything fails all the time
Fantastic post on large-scale distributed load balancing strategies from @aphyr. Random and least-conns routing comes out on top in his simulation (although he hasn't yet tried Marc Brooker's two-randoms routing strategy)
via:hn  routing  distributed  least-conns  load-balancing  round-robin  distcomp  networking  scaling 
february 2013 by jm
Splout
'Splout is a scalable, open-source, easy-to-manage SQL big data view. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Splout serves a read-only, partitioned SQL view which is generated and indexed by Hadoop.'

Some FAQs: 'What's the difference between Splout SQL and Dremel-like solutions such as BigQuery, Impala or Apache Drill? Splout SQL is not a "fast analytics" Dremel-like engine. It is more thought to be used for serving datasets under web / mobile high-throughput, many lookups, low-latency applications. Splout SQL is more like a NoSQL database in the sense that it has been thought for answering queries under sub-second latencies. It has been thought for performing queries that impact a very small subset of the data, not queries that analyze the whole dataset at once.'
splout  sql  big-data  hadoop  read-only  scaling  queries  analytics 
february 2013 by jm
High Scalability - Analyzing billions of credit card transactions and serving low-latency insights in the cloud
Hadoop, a batch-generated read-only Voldemort cluster, and an intriguing optimal-storage histogram bucketing algorithm:
The optimal histogram is computed using a random-restart hill climbing approximated algorithm.
The algorithm has been shown very fast and accurate: we achieved 99% accuracy compared to an exact dynamic algorithm, with a speed increase of one factor. [...] The amount of information to serve in Voldemort for one year of BBVA's credit card transactions on Spain is 270 GB. The whole processing flow would run in 11 hours on a cluster of 24 "m1.large" instances. The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month.
scalability  scaling  voldemort  hadoop  batch  algorithms  histograms  statistics  bucketing  percentiles 
february 2013 by jm
HTTP Error 403: The service you requested is restricted - Vodafone Community
Looks like Vodafone Ireland are failing to scale their censorware; clients on their network reporting "HTTP Error 403: The service you requested is restricted". According to a third-party site, this error is produced by the censorship software they use when it's insufficiently scaled for demand:

"When you try to use HTTP Vodafone route a request to their authentication server to see if your account is allow to connect to the site. By default they block a list of adult/premium web sites (this is service you have switched on or off with your account). The problem is at busy times this validation service is overloaded and so their systems get no response as to whether the site is allowed, so assume the site you asked for is restricted and gives the 403 error. Once this happens you seem to have to make new 3G data connection (reset the phone, move cell or let the connection time out) to get it to try again."


Sample: http://pic.twitter.com/N1lAwBjW
scaling  ireland  vodafone  fail  censorware  scalability  customer-service 
november 2012 by jm
Cliff Click's 2008 JavaOne talk about the NonBlockingHashTable
I'm a bit late to this data structure -- highly scalable, nearly lock-free, benchmarks very well (except with the G1 GC): http://edwwang.com/blog/2012/02/10/concurrent-hashmap-benchmark/ .

Having said that, it doesn't cope well with frequently-changing unique keys: http://sourceforge.net/tracker/?func=detail&aid=3563980&group_id=194172&atid=948362 .

More background at: http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable and http://www.azulsystems.com/blog/cliff/2007-04-01-non-blocking-hashtable-part-2

This was used in Cassandra for a while, although I think the above bug may have caused its removal?
nonblockinghashtable  data-structures  hashmap  concurrency  scaling  java  jvm 
october 2012 by jm
Evolution of SoundCloud's Architecture
nice write-up. nginx, Rails, RabbitMQ, MySQL, Cassandra, Elastic Search, HAProxy
soundcloud  webdev  architecture  scaling  scalability 
september 2012 by jm
High performance network programming on the JVM, OSCON 2012
by Erik Onnen of Urban Airship. very good presentation on the current state of the art in large-scale low-latency service operation using the JVM on Linux. Lots of good details on async vs sync, HTTPS/TLS/TCP tuning, etc.
http  https  scaling  jvm  async  sync  oscon  presentations  tcp 
july 2012 by jm
C500k in Action at Urban Airship
I missed this back in 2010; 500k active TCP connections to a single EC2 large instance using Java and NIO
c10k  java  linux  ec2  scaling  nio  netty  urban-airship 
july 2012 by jm
Scaling: It's Not What It Used To Be
skamille's top 5 scaling apps. "1. Redis. I was at a NoSQL meetup last night when someone asked "if you could put a million dollars behind one of the solutions presented here tonight, which one would you choose?" And the answer that one of the participants gave was "None of the above. I would choose Redis. Everyone uses one of these products and Redis."
2. Nginx. Your ops team probably already loves it. It's simple, it scales fabulously, and you don't have to be a programmer to understand how to run it.
3. HAProxy. Because if you're going to have hundreds or thousands of servers, you'd better have good load balancing.
4. Memcached. Redis can act as a cache but using a real caching product for such a purpose is probably a better call.
And finally:
5. Cloud hardware. Imagine trying to grow out to millions of users if you had to buy, install, and admin every piece of hardware you would need to do such a thing."
scaling  nginx  memcached  haproxy  redis 
april 2012 by jm
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Amazing stuff from Adrian Cockroft at last week's QCon. Faceted object model, lots of Cassandra automation
cassandra  api  design  oo  object-model  java  adrian-cockroft  slides  qcon  scaling  aws  netflix 
march 2012 by jm
Apache Kafka
'Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.' neat
kafka  linkedin  apache  distributed  messaging  pubsub  queue  incubator  scaling 
february 2012 by jm
Turbocharging Solr Index Replication with BitTorrent
Etsy now replicating their multi-GB search index across the search farm using BitTorrent. Why not Multicast? 'multicast rsync caused an epic failure for our network, killing the entire site for several minutes. The multicast traffic saturated the CPU on our core switches causing all of Etsy to be unreachable.' fun!
etsy  multicast  sev1  bittorrent  search  solr  rsync  scaling  outages 
february 2012 by jm
Benchmarking Cassandra Scalability on AWS - Over a million writes per second
NetFlix' benchmarks -- impressively detailed. '48, 96, 144 and 288 instances', across 3 EC2 AZs in us-east, successfully scaling linearly
ec2  aws  cassandra  scaling  benchmarks  netflix  performance 
november 2011 by jm
Storm
'The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to be. There's no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.

However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a "Hadoop of realtime" has become the biggest hole in the data processing ecosystem. Storm fills that hole.'
data  scaling  twitter  realtime  scalability  storm  queueing 
september 2011 by jm
_Scaling with MongoDB_, Michael Schurter 2011 [PDF]
presentation with some rather terrifying MongoDB war stories
mongodb  performance  presentation  scaling  war-stories 
june 2011 by jm
What Second Life can teach your datacenter about scaling Web apps
good scaling advice from Linden Labs' Ian Wilkes (who doesn't seem to have a blog, sadly)
linden  ian-wilkes  scaling  datacenters  scalability  deployment  ops  services  from delicious
february 2010 by jm
The technology behind Tornado, FriendFeed's web server
more on the new async HTTP server from FriendFeed/Facebook, in Python. looks lovely
async  http  epoll  python  comet  long-poll  facebook  scaling  scalability  web  friendfeed  tornado  opensource  from delicious
september 2009 by jm
Tornado Web Server
'an open source version of the scalable, non-blocking web server and tools that power FriendFeed. The FriendFeed application is written using a web framework that looks a bit like web.py or Google's webapp, but with additional tools and optimizations to take advantage of the underlying non-blocking (epoll) infrastructure.'
epoll  open-source  python  http  scalability  facebook  scaling  web  from delicious
september 2009 by jm
Dunbar's number
interesting anthropological stat - "the cognitive limit to the number of individuals with whom any one person can maintain stable relationships" = 150. "a direct function of relative neocortex size", "in turn limits group size." See also Kaa's law
dunbars-number  society  social-networks  groups  scaling  sociology  anthropology  robin-dunbar  relationships  primates 
march 2007 by jm
Dunbar's number
interesting anthropological stat - "the cognitive limit to the number of individuals with whom any one person can maintain stable relationships" = 150. "a direct function of relative neocortex size", "in turn limits group size." See also Kaa's law
dunbars-number  society  social-networks  groups  scaling  sociology  anthropology  robin-dunbar  relationships  primates 
march 2007 by jm

related tags

9-patch  9-slice  aas  adrian-cockroft  advice  akamai  algorithms  amazon  amdahls-law  analytics  android  anthropology  anti-spam  anycast  apache  api  app-engine  apple  architecture  asg  async  aurora  auto-scaling  autoscaling  aws  baron-schwartz  batch  ben-stopford  benchmarks  bgp  big-data  bitmaps  bittorrent  botmaker  bucketing  bugs  burst  c10k  c10m  cache  caching  cassandra  cdn  censorship  censorware  clos-networks  clustering  coding  columnar-stores  comet  compression  concurrency  conferences  constant-load  contention  cost-saving  counting  cpu  cqrs  craigslist  cropping  customer-service  data  data-structures  databases  datacenters  deployment  design  dht  distcomp  distributed  docker  druid  druid.io  dunbars-number  dynamodb  ebay  ebs  ec2  ecs  elastic-scaling  elasticsearch  emr  epoll  erlang  estimation  etsy  exabgp  fabrics  facebook  fail  failover  fault-tolerance  freebsd  friendfeed  frontline  gae  gc  gc-scout  git  github  go  google  google-cloud  graphics  groups  gzip  hadoop  haproxy  hardware  hashmap  histograms  hll  hosting  http  https  hyperloglog  ian-wilkes  image-processing  images  incubator  infrastructure  instances  internet  ireland  james-hamilton  java  javascript  jdk  jenkins  json  juniper  jvm  kafka  lambda-architecture  latency  least-conns  linden  linkedin  linux  littles-law  load  load-balancing  loading  locking  locks  logging  long-poll  lz4  map-reduce  martin-thompson  mechanical-sympathy  memcache  memcached  mesh  mesh-networks  messaging  metamarkets  metrics  modelling  money  mongodb  multicast  multifeed  mysql  netflix  netty  network-architecture  networking  nginx  nio  node.js  nonblockinghashtable  object-model  oo  open-source  openjdk  opensource  ops  optimization  organisation  oscon  ospf  outages  overload  packet-loss  parquet  pdf  pegasus  percentiles  performance  pinterest  piops  power  power-management  prediction  presentation  presentations  primates  protocols  proxy  pubsub  python  qcon  queries  queue  queueing  rdbms  read-only  realtime  redis  redshift  refs  relationships  reliability  rest  robin-dunbar  round-robin  route53  routing  rsync  ruby  rules  s3  scalability  scale  scale9  scaling  schedulers  scryer  search  security  services  sev1  sharding  shutterstock  slas  slides  snabb-switch  snappy  soa  social-networks  society  sociology  software  solr  soundcloud  spam  spikes  split-brain  splout  spot-fleet  spot-fleets  spot-instances  spotify  sql  ssl  statistics  storage  storm  streaming  streams  sync  system-dynamics  system-testing  systems  tags  tcp  testing  threadpools  threads  thumbnails  time-series  timeouts  tls  tornado  tsd  tuning  twitter  udp  ui  urban-airship  usl  velocity  version-control  via:brianscanlan  via:eoinbrazil  via:fanf  via:highscalability  via:hn  via:marc-brooker  via:matt-sergeant  vips  vodafone  voldemort  war-stories  web  webdev  whatsapp  wifi  wireless  yao-yu  yelp 

Copy this bookmark:



description:


tags: