jm + streams   26

Kafka Streams - Scaling up or down
this is a nice zero-config scaling story -- good work Kafka Streams
scaling  scalability  architecture  kafka  streams  ops 
october 2016 by jm
a high-performance multiple regex matching library. Hyperscan uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions and for the matching of regular expressions across streams of data.

Via Tony Finch
via:fanf  regexps  regex  dpi  hyperscan  dfa  nfa  hybrid-automata  text-matching  matching  text  strings  streams 
october 2015 by jm
'Seekable and Splittable Gzip', from eBay
ebay  gzip  compression  seeking  streams  splitting  logs  gzinga 
october 2015 by jm
Evolution of Babbel’s data pipeline on AWS: from SQS to Kinesis
Good "here's how we found it" blog post:

Our new data pipeline with Kinesis in place allows us to plug new consumers without causing any damage to the current system, so it’s possible to rewrite all Queue Workers one by one and replace them with Kinesis Workers. In general, the transition to Kinesis was smooth and there were not so tricky parts.
Another outcome was significantly reduced costs – handling almost the same amount of data as SQS, Kinesis appeared to be many times cheaper than SQS.
aws  kinesis  kafka  streaming  data-pipelines  streams  sqs  queues  architecture  kcl 
september 2015 by jm
Mining High-Speed Data Streams: The Hoeffding Tree Algorithm
This paper proposes a decision tree learner for data streams, the Hoeffding Tree algorithm, which comes with the guarantee that the learned decision tree is asymptotically nearly identical to that of a non-incremental learner using infinitely many examples. This work constitutes a significant step in developing methodology suitable for modern ‘big data’ challenges and has initiated a lot of follow-up research. The Hoeffding Tree algorithm has been covered in various textbooks and is available in several public domain tools, including the WEKA Data Mining platform.
hoeffding-tree  algorithms  data-structures  streaming  streams  cep  decision-trees  ml  learning  papers 
august 2015 by jm
"last seen" sketch
a new sketch algorithm from Baron Schwartz and Preetam Jinka of VividCortex; similar to Count-Min but with last-seen timestamp instead of frequency.
sketch  algorithms  estimation  approximation  sampling  streams  big-data 
july 2015 by jm
'Utilities that help bridge the gap between Java 8 and Google Guava. Guava has the {@link FluentIterable} concept which is similar to streams. In many ways, fluent iterable is nicer, because it directly binds to the immutable collection classes. However, on balance it seems wise to use the stream API rather than {@code FluentIterable} in Java 8.'
guava  java-8  java  fluentiterable  streams  fluent  coding 
april 2015 by jm
Reactive Programming for a demanding world
"building event-driven and responsive applications with RxJava", slides by Mario Fusco. Good info on practical Rx usage in Java
rxjava  rx  reactive  coding  backpressure  streams  observables 
march 2015 by jm
The official REST Proxy for Kafka
The REST Proxy is an open source HTTP-based proxy for your Kafka cluster. The API supports many interactions with your cluster, including producing and consuming messages and accessing cluster metadata such as the set of topics and mapping of partitions to brokers. Just as with Kafka, it can work with arbitrary binary data, but also includes first-class support for Avro and integrates well with Confluent’s Schema Registry. And it is scalable, designed to be deployed in clusters and work with a variety of load balancing solutions.

We built the REST Proxy first and foremost to meet the growing demands of many organizations that want to use Kafka, but also want more freedom to select languages beyond those for which stable native clients exist today. However, it also includes functionality beyond traditional clients, making it useful for building tools for managing your Kafka cluster. See the documentation for a more detailed description of the included features.
kafka  rest  proxies  http  confluent  queues  messaging  streams  architecture 
march 2015 by jm
[KAFKA-1555] provide strong consistency with reasonable availability
Major improvements for Kafka consistency coming in 0.8.2; replication to multiple in-sync replicas, controlled by a new "min.isr" setting
kafka  replication  cap  consistency  streams 
october 2014 by jm
"Quantiles on Streams" [paper, 2009]
'Chiranjeeb Buragohain and Subhash Suri: "Quantiles on Streams" in Encyclopedia of Database Systems, Springer, pp 2235–2240, 2009. ISBN: 978-0-387-35544-3', cited by Martin Kleppman in as a good, short literature survey re estimating percentiles with a small memory footprint.
latency  percentiles  coding  quantiles  streams  papers  algorithms 
october 2014 by jm
Collection Pipeline
a nice summarisation of the state of pipe/stream-oriented collection operations in various languages, from Martin Fowler
martin-fowler  patterns  coding  ruby  clojure  streams  pipelines  pipes  unix  lambda  fp  java  languages 
july 2014 by jm
Spark Streaming
an extension of the core Spark API that allows enables high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s in-built machine learning algorithms, and graph processing algorithms on data streams.
spark  streams  stream-processing  cep  scalability  apache  machine-learning  graphs 
may 2014 by jm
Sketch of the Day – Frugal Streaming
ha, this is very clever! If you have enough volume, this is a nice estimation algorithm to compute stream quantiles in very little RAM
memory  streaming  stream-processing  clever  algorithms  hacks  streams 
september 2013 by jm
Sketch of the Day: K-Minimum Values
Another sketching algorithm -- this one supports set union and intersection operations more easily than HyperLogLog when there are more than 2 sets
algorithms  coding  space-saving  cardinality  streams  stream-processing  estimation  sets  sketching 
june 2013 by jm
Approximate Heavy Hitters -The SpaceSaving Algorithm
nice, readable intro to SpaceSaving (which I've linked to before) -- a simple stream-processing cardinality top-K estimation algorithm with bounded error.
algorithms  coding  space-saving  cardinality  streams  stream-processing  estimation 
may 2013 by jm
good blog post on histogram-estimation stream processing algorithms
After reviewing several dozen papers, a score or so in depth, I identified two data structures that appear to enable us to answer these recency and frequency queries: exponential histograms (from "Maintaining Stream Statistics Over Sliding Windows" by Datar et al.) and waves (from "Distributed Streams Algorithms for Sliding Windows" by Gibbons and Tirthapura). Both of these data structures are used to solve the so-called counting problem, the problem of determining, with a bound on the relative error, the number of 1s in the last N units of time. In other words, the data structures are able to answer the question: how many 1s appeared in the last n units of time within a factor of Error (e.g., 50%). The algorithms are neat, so I'll present them briefly.
streams  streaming  stream-processing  histograms  percentiles  estimation  waves  statistics  algorithms 
february 2013 by jm
Distributed Streams Algorithms for Sliding Windows [PDF]
'Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate
functions over a “sliding window” of the N most recent data items in one or more streams. [...] Our results are obtained using a novel family of synopsis data structures called waves.'
waves  papers  streaming  algorithms  percentiles  histogram  distcomp  distributed  aggregation  statistics  estimation  streams 
february 2013 by jm
'Medians and Beyond: New Aggregation Techniques for Sensor Networks' [paper, PDF]
'We introduce Quantile Digest or q-digest, a novel data structure which provides provable guarantees on approximation error and maximum resource consumption. In more concrete terms, if the values returned by the sensors are integers in the range [1;n], then using q-digest we can answer quantile queries using message size m within an error of O(log(n)/m). We also outline how we can use q-digest to answer other queries such as range queries, most frequent items and histograms. Another notable property of q-digest is that in addition to the theoretical worst case bound error, the structure carries with itself an estimate of error for this particular query.'
q-digest  algorithms  streams  approximation  histograms  median  percentiles  quantiles 
february 2013 by jm
clearspring / stream-lib
ASL-licensed open source library of stream-processing/approximation algorithms: count-min sketch, space-saving top-k, cardinality estimation, LogLog, HyperLogLog, MurmurHash, lookup3 hash, Bloom filters, q-digest, stochastic top-k
algorithms  coding  streams  cep  stream-processing  approximation  probabilistic  space-saving  top-k  cardinality  estimation  bloom-filters  q-digest  loglog  hyperloglog  murmurhash  lookup3 
february 2013 by jm
'Efficient Computation of Frequent and Top-k Elements in Data Streams' [paper, PDF]
The Space-Saving algorithm to compute top-k in a stream. I've been asking a variation of this problem as an interview question for a while now, pretty cool to find such a neat solution. Pity neither myself nor anyone I've interviewed has come up with it ;)
space-saving  approximation  streams  stream-processing  cep  papers  pdf  algorithms 
february 2013 by jm
Real-time Analytics in Scala [slides, PDF]
some good approximation/streaming algorithms and tips on Scala implementation
streams  algorithms  approximation  coding  scala  slides 
february 2013 by jm
Data distribution in the cloud with Node.js
Very interesting presentation from ex-IONAian Darach Ennis of Push Technology on eep.js, embedded event processing in Javascript for node.js stream processing. Handles tumbling, monotonic, periodic and sliding windows at 8-40 million events per second; no multi-dimensional, infinite or predicate event-processing windows. (via Sergio Bossa)
via:sbtourist  events  event-processing  streaming  data  ex-iona  darach-ennis  push-technology  cep  javascript  node.js  streams 
october 2012 by jm

related tags

aggregation  algorithms  apache  approximation  architecture  aws  backpressure  benchmarks  big-data  binary-heap  bloom-filters  cap  cardinality  cep  clever  clojure  coding  compression  confluent  consistency  count-min  counting  cuckoo-filters  darach-ennis  data  data-pipelines  data-structures  decision-trees  demos  dfa  distcomp  distributed  dpi  ebay  estimation  event-processing  events  ex-iona  fluent  fluentiterable  fp  frequency  graphs  guava  gzinga  gzip  hacks  histogram  histograms  hll  hoeffding-tree  http  hybrid-automata  hyperloglog  hyperscan  java  java-8  javascript  kafka  kcl  kinesis  lambda  languages  latency  learning  loglog  logs  lookup3  machine-learning  martin-fowler  matching  median  memory  messaging  minhash  ml  murmurhash  nfa  node.js  observables  ops  papers  patterns  pdf  percentiles  performance  pipelines  pipes  priority-queue  probabilistic  proxies  push-technology  q-digest  quantiles  queues  quickselect  reactive  regex  regexps  replication  rest  ruby  rx  rxjava  sampling  scala  scalability  scaling  seeking  sets  sketch  sketches  sketching  slides  space-saving  spark  splitting  sqs  statistics  stream-processing  streaming  streams  strings  text  text-matching  top-k  unix  via:fanf  via:sbtourist  waves 

Copy this bookmark: