jm + storm   23

Open Sourcing Twitter Heron
Twitter are open sourcing their Storm replacement, and moving it to an independent open source foundation
open-source  twitter  heron  storm  streaming  architecture  lambda-architecture 
may 2016 by jm
SuperChief: From Apache Storm to In-House Distributed Stream Processing
Another sorry tale of Storm issues:
Storm has been successful at Librato, but we experienced many of the limitations cited in the Twitter Heron: Stream Processing at Scale paper and outlined here by Adrian Colyer, including:
Inability to isolate, reason about, or debug performance issues due to the worker/executor/task paradigm. This led to building and configuring clusters specifically designed to attempt to mitigate these problems (i.e., separate clusters per topology, only running a worker per server.), which added additional complexity to development and operations and also led to over-provisioning.
Ability of tasks to move around led to difficult to trace performance problems.
Storm’s work provisioning logic led to some tasks serving more Kafka partitions than others. This in turn created latency and performance issues that were difficult to reason about. The initial solution was to over-provision in an attempt to get a better hashing/balancing of work, but eventually we just replaced the work allocation logic.
Due to Storm’s architecture, it was very difficult to get a stack trace or heap dump because the processes that managed workers (Storm supervisor) would often forcefully kill a Java process while it was being investigated in this way.
The propensity for unexpected and subsequently unhandled exceptions to take down an entire worker led to additional defensive verbose error handling everywhere.
This nasty bug STORM-404 coupled with the aforementioned fact that a single exception can take down a worker led to several cascading failures in production, taking down entire topologies until we upgraded to 0.9.4.
Additionally, we found the performance we were getting from Storm for the amount of money we were spending on infrastructure was not in line with our expectations. Much of this is due to the fact that, depending upon how your topology is designed, a single tuple may make multiple hops across JVMs, and this is very expensive. For example, in our time series aggregation topologies a single tuple may be serialized/deserialized and shipped across the wire 3-4 times as it progresses through the processing pipeline.
scalability  storm  kafka  librato  architecture  heron  ops 
october 2015 by jm
Adrian Colyer reviews the Twitter Heron paper
ouch, really sounds like Storm didn't cut the muster. 'It’s hard to imagine something more damaging to Apache Storm than this. Having read it through, I’m left with the impression that the paper might as well have been titled “Why Storm Sucks”, which coming from Twitter themselves is quite a statement.'

If I was to summarise the lessons learned, it sounds like: backpressure is required; and multi-tenant architectures suck.

Update: response from Storm dev ptgoetz here:
storm  twitter  heron  big-data  streaming  realtime  backpressure 
june 2015 by jm
Twitter ditches Storm
in favour of a proprietary ground-up rewrite called Heron. Reading between the lines it sounds like Storm had problems with latency, reliability, data loss, and supporting back pressure.
analytics  architecture  twitter  storm  heron  backpressure  streaming  realtime  queueing 
june 2015 by jm
Twitter's Answers architecture
Twitter's mobile-device analytics service architecture, with Kafka and Storm in full Lambda-Architecture mode
twitter  lambda-architecture  storm  kafka  architecture 
february 2015 by jm
Apache Kafka 0.8 basic training
This is a pretty voluminous and authoritative presentation about getting started with Kafka; wish this was around when we started using it for 0.7. (We use our own homegrown realtime system nowadays, due to better partitioning, monitoring and operability.)
storm  kafka  presentations  documentation  ops 
august 2014 by jm
Real time analytics with Netty, Storm, Kafka
Arch of a fairly typical Kafka/Storm realtime ad-tracking setup, from eClick/mc2ads, via Trustin Lee
via:trustinlee  kafka  storm  netty  architecture  ad-tracking  ads  realtime 
august 2014 by jm
Questioning the Lambda Architecture
Jay Kreps (Kafka, Samza) with a thought-provoking post on the batch/stream-processing dichotomy
jay-kreps  toread  architecture  data  stream-processing  batch  hadoop  storm  lambda-architecture 
july 2014 by jm
Pinterest Secor
Today we’re open sourcing Secor, a zero data loss log persistence service whose initial use case was to save logs produced by our monetization pipeline. Secor persists Kafka logs to long-term storage such as Amazon S3. It’s not affected by S3’s weak eventual consistency model, incurs no data loss, scales horizontally, and optionally partitions data based on date.
pinterest  hadoop  secor  storm  kafka  architecture  s3  logs  archival 
may 2014 by jm
SAMOA, an open source platform for mining big data streams
Yahoo!'s streaming machine learning platform, built on Storm, implementing:

As a library, SAMOA contains state-of-the-art implementations of algorithms for distributed machine learning on streams. The first alpha release allows classification and clustering. For classification, we implemented a Vertical Hoeffding Tree (VHT), a distributed streaming version of decision trees tailored for sparse data (e.g., text). For clustering, we included a distributed algorithm based on CluStream. The library also includes meta-algorithms such as bagging.
storm  streaming  big-data  realtime  samoa  yahoo  machine-learning  ml  decision-trees  clustering  bagging  classification 
november 2013 by jm
Storm at - London Storm Meetup 2013-06-18
Not just a Storm success story. Interesting slides indicating where a startup *stopped* using Storm as realtime wasn't useful to their customers
storm  realtime  hadoop  cascading  python  cep  anti-spam  events  architecture  distcomp  low-latency  slides  rabbitmq 
october 2013 by jm
Making Storm fly with Netty | Yahoo Engineering
Y! engineer doubles the speed of Storm's messaging layer by replacing the zeromq implementation with Netty
netty  async  zeromq  storm  messaging  tcp  benchmarks  yahoo  clusters 
october 2013 by jm
Behind the Screens at Loggly
Boost ASIO at the front end (!), Kafka 0.8, Storm, and ElasticSearch
boost  scalability  loggly  logging  ingestion  cep  stream-processing  kafka  storm  architecture  elasticsearch 
september 2013 by jm
Streaming MapReduce with Summingbird
Before Summingbird at Twitter, users that wanted to write production streaming aggregations would typically write their logic using a Hadoop DSL like Pig or Scalding. These tools offered nice distributed system abstractions: Pig resembled familiar SQL, while Scalding, like Summingbird, mimics the Scala collections API. By running these jobs on some regular schedule (typically hourly or daily), users could build time series dashboards with very reliable error bounds at the unfortunate cost of high latency.

While using Hadoop for these types of loads is effective, Twitter is about real-time and we needed a general system to deliver data in seconds, not hours. Twitter’s release of Storm made it easy to process data with very low latencies by sacrificing Hadoop’s fault tolerant guarantees. However, we soon realized that running a fully real-time system on Storm was quite difficult for two main reasons:

Recomputation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log loading mechanism;
Storm is focused on message passing and random-write databases are harder to maintain.

The types of aggregations one can perform in Storm are very similar to what’s possible in Hadoop, but the system issues are very different. Summingbird began as an investigation into a hybrid system that could run a streaming aggregation in both Hadoop and Storm, as well as merge automatically without special consideration of the job author. The hybrid model allows most data to be processed by Hadoop and served out of a read-only store. Only data that Hadoop hasn’t yet been able to process (data that falls within the latency window) would be served out of a datastore populated in real-time by Storm. But the error of the real-time layer is bounded, as Hadoop will eventually get around to processing the same data and will smooth out any error introduced. This hybrid model is appealing because you get well understood, transactional behavior from Hadoop, and up to the second additions from Storm. Despite the appeal, the hybrid approach has the following practical problems:

Two sets of aggregation logic have to be kept in sync in two different systems;
Keys and values must be serialized consistently between each system and the client.

The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results
Summingbird was developed to provide a general solution to these problems.

Very interesting stuff. I'm particularly interested in the design constraints they've chosen to impose to achieve this -- data formats which require associative merging in particular.
mapreduce  streaming  big-data  twitter  storm  summingbird  scala  pig  hadoop  aggregation  merging 
september 2013 by jm
Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing
Yahoo! are going big with Storm for their next-generation internal cloud platform:

'Yahoo! engineering teams are developing technologies to enable Storm applications and Hadoop applications to be hosted on a single cluster.

• We have enhanced Storm to support Hadoop style security mechanism (including Kerberos authentication), and thus enable Storm applications authorized to access Hadoop datasets on HDFS and HBase.
• Storm is being integrated into Hadoop YARN for resource management. Storm-on-YARN enables Storm applications to utilize the computation resources in our tens of thousands of Hadoop computation nodes. YARN is used to launch Storm application master (Nimbus) on demand, and enables Nimbus to request resources for Storm application slaves (Supervisors).'
yahoo  yarn  cloud-computing  private-clouds  big-data  latency  storm  hadoop  elastic-computing  hbase 
february 2013 by jm
Clairvoyant Squirrel: Large Scale Malicious Domain Classification
Storm-based service to detect malicious DNS domain usage from streaming pcap data in near-real-time. Uses string features in the DNS domain, along with randomness metrics using Markov analysis, combined with a Random Forest classifier, to achieve 98% precision at 10,000 matches/sec
storm  distributed  distcomp  random-forest  classifiers  machine-learning  anti-spam  slides 
february 2013 by jm
Big Data Lambda Architecture
An article by Nathan "Storm" Marz describing the system architecture he's been talking about for a while; Hadoop-driven batch view, Storm-driven "speed view", and a merging API
storm  systems  architecture  lambda-architecture  design  Hadoop 
january 2013 by jm
Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithm in Storm
Storm demo with a reasonably complex topology.
'how to implement a distributed, real-time trending topics algorithm in Storm. It uses the latest features available in Storm 0.8 (namely tick tuples) and should be a good starting point for anyone trying to implement such an algorithm for their own application. The new code is now available in the official storm-starter repository, so feel free to take a deeper look.'
storm  distcomp  distributed  tick-tuples  demo 
january 2013 by jm
Trident: a high-level abstraction for realtime computation
built on Storm:

Trident is a new high-level abstraction for doing realtime computing on top of Twitter Storm, available in Storm 0.8.0. It allows you to seamlessly mix high throughput (millions of messages per second), stateful stream processing with low latency distributed querying. If you're familiar with high level batch processing tools like Pig or Cascading, the concepts of Trident will be very familiar - Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.
distributed  realtime  twitter  storm  trident  distcomp  stream-processing  low-latency  nathan-marz 
october 2012 by jm
How to beat the CAP theorem
Nathan "Storm" Marz on building a dual realtime/batch stack. This lines up with something I've been building in work, so I'm happy ;)
nathan-marz  realtime  batch  hadoop  storm  big-data  cap 
october 2011 by jm
Hacker News thread on Storm
lots of good questions and answers in here
twitter  storm  distcomp  distributed 
september 2011 by jm
Storm: distributed and fault-tolerant realtime computation
intro slideshow to this really nifty-looking distcomp platform
distcomp  distributed  realtime  storm  slides  twitter 
september 2011 by jm
'The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to be. There's no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.

However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a "Hadoop of realtime" has become the biggest hole in the data processing ecosystem. Storm fills that hole.'
data  scaling  twitter  realtime  scalability  storm  queueing 
september 2011 by jm

Copy this bookmark: