jm + interchange   4

S3 Inventory Adds Apache ORC output format and Amazon Athena Integration
Interesting to see Amazon are kind of putting their money behind ORC as a new public data interchange format with this.

Update: the Amazon senior PM for Athena and EMR says: 'Actually, we like both ORC and Parquet. Athena can process both ORC and Parquet, and teams can choose if they want to use either.' --
orc  formats  data  interchange  s3  athena  output 
21 days ago by jm - Parsing JSON is a Minefield 💣
Crockford chose not to version [the] JSON definition: 'Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.' Yet JSON is defined in at least six different documents.

"Boldest". ffs. :facepalm:
bold  courage  json  parsing  coding  data  formats  interchange  fail  standards  confusion 
october 2016 by jm
"A modern standard for event-oriented data". Avro schema, events have time and type, schema is external and not part of the Avro stream.

'a modern standard for representing event-oriented data in high-throughput operational systems. It uses existing open standards for schema definition and serialization, but adds semantic meaning and definition to make integration between systems easy, while still being size- and processing-efficient.

An Osso event is largely use case agnostic, and can represent a log message, stack trace, metric sample, user action taken, ad display or click, generic HTTP event, or otherwise. Every event has a set of common fields as well as optional key/value attributes that are typically event type-specific.'
osso  events  schema  data  interchange  formats  cep  event-processing  architecture 
september 2016 by jm
The problem of managing schemas
Good post on the pain of using CSV/JSON as a data interchange format:
eventually, the schema changes. Someone refactors the code generating the JSON and moves fields around, perhaps renaming few fields. The DBA added new columns to a MySQL table and this reflects in the CSVs dumped from the table. Now all those applications and scripts must be modified to handle both file formats. And since schema changes happen frequently, and often without warning, this results in both ugly and unmaintainable code, and in grumpy developers who are tired of having to modify their scripts again and again.
schema  json  avro  protobuf  csv  data-formats  interchange  data  hadoop  files  file-formats 
november 2014 by jm

Copy this bookmark: