jm + avro   6

Schema evolution in Avro, Protocol Buffers and Thrift
Good description of this key feature of decent serialization formats
avro  thrift  protobuf  schemas  serialization  coding  interop  compatibility 
january 2016 by jm
Avro, mail # dev - bytes and fixed handling in Python implementation - 2014-09-04, 22:54
More Avro trouble with "bytes" fields! Avoid using "bytes" fields in Avro if you plan to interoperate with either of the Python implementations; they both fail to marshal them into JSON format correctly. This is the official "avro" library, which produces UTF-8 errors when a non-UTF-8 byte is encountered
bytes  avro  marshalling  fail  bugs  python  json  utf-8 
march 2015 by jm
Kafka best practices
This is the second part of our guide on streaming data and Apache Kafka. In part one I talked about the uses for real-time data streams and explained our idea of a stream data platform. The remainder of this guide will contain specific advice on how to go about building a stream data platform in your organization.


tl;dr: limit the number of Kafka clusters; use Avro.
architecture  kafka  storage  streaming  event-processing  avro  schema  confluent  best-practices  tips 
march 2015 by jm
The problem of managing schemas
Good post on the pain of using CSV/JSON as a data interchange format:
eventually, the schema changes. Someone refactors the code generating the JSON and moves fields around, perhaps renaming few fields. The DBA added new columns to a MySQL table and this reflects in the CSVs dumped from the table. Now all those applications and scripts must be modified to handle both file formats. And since schema changes happen frequently, and often without warning, this results in both ugly and unmaintainable code, and in grumpy developers who are tired of having to modify their scripts again and again.
schema  json  avro  protobuf  csv  data-formats  interchange  data  hadoop  files  file-formats 
november 2014 by jm
Integrating Kafka and Spark Streaming: Code Examples and State of the Game
Spark Streaming has been getting some attention lately as a real-time data processing tool, often mentioned alongside Apache Storm. [...] I added an example Spark Streaming application to kafka-storm-starter that demonstrates how to read from Kafka and write to Kafka, using Avro as the data format and Twitter Bijection for handling the data serialization. In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state of Kafka integration in Spark Streaming. All this with the disclaimer that this happens to be my first experiment with Spark Streaming.
spark  kafka  realtime  architecture  queues  avro  bijection  batch-processing 
october 2014 by jm
Vlnt
'A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on.' UTF8-ish compression, used in Avro
utf8  compression  utf  lucene  avro  hadoop  java  fomats  numeric  from delicious
november 2009 by jm

Copy this bookmark:



description:


tags: