jm + event-processing   12

"A modern standard for event-oriented data". Avro schema, events have time and type, schema is external and not part of the Avro stream.

'a modern standard for representing event-oriented data in high-throughput operational systems. It uses existing open standards for schema definition and serialization, but adds semantic meaning and definition to make integration between systems easy, while still being size- and processing-efficient.

An Osso event is largely use case agnostic, and can represent a log message, stack trace, metric sample, user action taken, ad display or click, generic HTTP event, or otherwise. Every event has a set of common fields as well as optional key/value attributes that are typically event type-specific.'
osso  events  schema  data  interchange  formats  cep  event-processing  architecture 
september 2016 by jm
The world beyond batch: Streaming 101 - O'Reilly Media
To summarize, in this post I’ve:

Clarified terminology, specifically narrowing the definition of “streaming” to apply to execution engines only, while using more descriptive terms like unbounded data and approximate/speculative results for distinct concepts often categorized under the “streaming” umbrella.

Assessed the relative capabilities of well-designed batch and streaming systems, positing that streaming is in fact a strict superset of batch, and that notions like the Lambda Architecture, which are predicated on streaming being inferior to batch, are destined for retirement as streaming systems mature.

Proposed two high-level concepts necessary for streaming systems to both catch up to and ultimately surpass batch, those being correctness and tools for reasoning about time, respectively.

Established the important differences between event time and processing time, characterized the difficulties those differences impose when analyzing data in the context of when they occurred, and proposed a shift in approach away from notions of completeness and toward simply adapting to changes in data over time.

Looked at the major data processing approaches in common use today for bounded and unbounded data, via both batch and streaming engines, roughly categorizing the unbounded approaches into: time-agnostic, approximation, windowing by processing time, and windowing by event time.
streaming  batch  big-data  lambda-architecture  dataflow  event-processing  cep  millwheel  data  data-processing 
august 2015 by jm
AWS Lambda Event-Driven Architecture With Amazon SNS
Any message posted to an SNS topic can trigger the execution of custom code you have written, but you don’t have to maintain any infrastructure to keep that code available to listen for those events and you don’t have to pay for any infrastructure when the code is not being run. This is, in my opinion, the first time that Amazon can truly say that AWS Lambda is event-driven, as we now have a central, independent, event management system (SNS) where any authorized entity can trigger the event (post a message to a topic) and any authorized AWS Lambda function can listen for the event, and neither has to know about the other.
aws  ec2  lambda  sns  events  cep  event-processing  coding  cloud  hacks  eric-hammond 
april 2015 by jm
Kafka best practices
This is the second part of our guide on streaming data and Apache Kafka. In part one I talked about the uses for real-time data streams and explained our idea of a stream data platform. The remainder of this guide will contain specific advice on how to go about building a stream data platform in your organization.

tl;dr: limit the number of Kafka clusters; use Avro.
architecture  kafka  storage  streaming  event-processing  avro  schema  confluent  best-practices  tips 
march 2015 by jm
Announcing Confluent, A Company for Apache Kafka And Realtime Data
Jay Kreps, Neha Narkhede, and Jun Rao are leaving LinkedIn to form a Kafka-oriented realtime event processing company
realtime  event-processing  logs  kafka  streaming  open-source  jay-kreps  jun-rao  confluent 
november 2014 by jm
All Data Are Belong to AWS: Streaming upload via Fluentd
Fluentd looks like a decent foundation for tailing/streaming event processing in Ruby, supporting batched output to S3 and a bunch of other AWS services, Kafka, and RabbitMQ for output. Claims to have ok performance, despite its Rubbitude. However, its high-availability story is shite, so not to be used where availability is important
ruby  rabbitmq  kafka  tail  event-streaming  cep  event-processing  s3  aws  sqs  fluentd 
august 2014 by jm
Twitter's TSAR
TSAR = "Time Series AggregatoR". Twitter's new event processor-style architecture for internal metrics. It's notable that now Twitter and Google are both apparently moving towards this idea of a model of code which is designed to run equally in realtime streaming and batch modes (Summingbird, Millwheel, Flume).
analytics  architecture  twitter  tsar  aggregation  event-processing  metrics  streaming  hadoop  batch 
june 2014 by jm
a realtime processing engine, built on a persistent queue and a set of workers. 'The main goal is data availability and persistency. We created grape for those who cannot afford losing data'. It does this by allowing infinite expansion of the pending queue in Elliptics, their Dynamo-like horizontally-scaled storage backend.
kafka  queue  queueing  storage  realtime  fault-tolerance  grape  cep  event-processing 
november 2013 by jm
_MillWheel: Fault-Tolerant Stream Processing at Internet Scale_ [paper, pdf]
from VLDB 2013:

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees.

This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
millwheel  google  data-processing  cep  low-latency  fault-tolerance  scalability  papers  event-processing  stream-processing 
august 2013 by jm
Data distribution in the cloud with Node.js
Very interesting presentation from ex-IONAian Darach Ennis of Push Technology on eep.js, embedded event processing in Javascript for node.js stream processing. Handles tumbling, monotonic, periodic and sliding windows at 8-40 million events per second; no multi-dimensional, infinite or predicate event-processing windows. (via Sergio Bossa)
via:sbtourist  events  event-processing  streaming  data  ex-iona  darach-ennis  push-technology  cep  javascript  node.js  streams 
october 2012 by jm

Copy this bookmark: