jm + mapreduce   5

Streaming MapReduce with Summingbird
Before Summingbird at Twitter, users that wanted to write production streaming aggregations would typically write their logic using a Hadoop DSL like Pig or Scalding. These tools offered nice distributed system abstractions: Pig resembled familiar SQL, while Scalding, like Summingbird, mimics the Scala collections API. By running these jobs on some regular schedule (typically hourly or daily), users could build time series dashboards with very reliable error bounds at the unfortunate cost of high latency.

While using Hadoop for these types of loads is effective, Twitter is about real-time and we needed a general system to deliver data in seconds, not hours. Twitter’s release of Storm made it easy to process data with very low latencies by sacrificing Hadoop’s fault tolerant guarantees. However, we soon realized that running a fully real-time system on Storm was quite difficult for two main reasons:

Recomputation over months of historical logs must be coordinated with Hadoop or streamed through Storm with a custom log loading mechanism;
Storm is focused on message passing and random-write databases are harder to maintain.

The types of aggregations one can perform in Storm are very similar to what’s possible in Hadoop, but the system issues are very different. Summingbird began as an investigation into a hybrid system that could run a streaming aggregation in both Hadoop and Storm, as well as merge automatically without special consideration of the job author. The hybrid model allows most data to be processed by Hadoop and served out of a read-only store. Only data that Hadoop hasn’t yet been able to process (data that falls within the latency window) would be served out of a datastore populated in real-time by Storm. But the error of the real-time layer is bounded, as Hadoop will eventually get around to processing the same data and will smooth out any error introduced. This hybrid model is appealing because you get well understood, transactional behavior from Hadoop, and up to the second additions from Storm. Despite the appeal, the hybrid approach has the following practical problems:

Two sets of aggregation logic have to be kept in sync in two different systems;
Keys and values must be serialized consistently between each system and the client.

The client is responsible for reading from both datastores, performing a final aggregation and serving the combined results
Summingbird was developed to provide a general solution to these problems.

Very interesting stuff. I'm particularly interested in the design constraints they've chosen to impose to achieve this -- data formats which require associative merging in particular.
mapreduce  streaming  big-data  twitter  storm  summingbird  scala  pig  hadoop  aggregation  merging 
september 2013 by jm
'Highly Sensitive Short Read Mapping with MapReduce'. current state of the art in DNA sequence read-mapping algorithms.
CloudBurst uses well-known seed-and-extend algorithms to map reads to a reference genome. It can map reads with any number of differences or mismatches. [..] Given an exact seed, CloudBurst attempts to extend the alignment into an end-to-end alignment with at most k mismatches or differences by either counting mismatches of the two sequences, or with a dynamic programming algorithm to allow for gaps. CloudBurst uses [Hadoop] to catalog and extend the seeds. In the map phase, the map function emits all length-s k-mers from the reference sequences, and all non-overlapping length-s kmers from the reads. In the shuffle phase, read and reference kmers are brought together. In the reduce phase, the seeds are extended into end-to-end alignments. The power of MapReduce and CloudBurst is the map and reduce functions run in parallel over dozens or hundreds of processors.

JM_SOUGHT -- the next generation ;)
bioinformatics  mapreduce  hadoop  read-alignment  dna  sequencing  sought  antispam  algorithms 
july 2012 by jm
MapReduce Patterns, Algorithms, and Use Cases
'I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found in the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting.'
algorithms  hadoop  java  mapreduce  patterns  distcomp 
february 2012 by jm

Copy this bookmark: