jm + interchange 4
S3 Inventory Adds Apache ORC output format and Amazon Athena Integration
november 2017 by jm
Interesting to see Amazon are kind of putting their money behind ORC as a new public data interchange format with this.
Update: the Amazon senior PM for Athena and EMR says: 'Actually, we like both ORC and Parquet. Athena can process both ORC and Parquet, and teams can choose if they want to use either.' -- https://twitter.com/abysinha/status/932700622540849152
orc
formats
data
interchange
s3
athena
output
Update: the Amazon senior PM for Athena and EMR says: 'Actually, we like both ORC and Parquet. Athena can process both ORC and Parquet, and teams can choose if they want to use either.' -- https://twitter.com/abysinha/status/932700622540849152
november 2017 by jm
seriot.ch - Parsing JSON is a Minefield 💣
"Boldest". ffs. :facepalm:
bold
courage
json
parsing
coding
data
formats
interchange
fail
standards
confusion
october 2016 by jm
Crockford chose not to version [the] JSON definition: 'Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.' Yet JSON is defined in at least six different documents.
"Boldest". ffs. :facepalm:
october 2016 by jm
Osso
september 2016 by jm
"A modern standard for event-oriented data". Avro schema, events have time and type, schema is external and not part of the Avro stream.
'a modern standard for representing event-oriented data in high-throughput operational systems. It uses existing open standards for schema definition and serialization, but adds semantic meaning and definition to make integration between systems easy, while still being size- and processing-efficient.
An Osso event is largely use case agnostic, and can represent a log message, stack trace, metric sample, user action taken, ad display or click, generic HTTP event, or otherwise. Every event has a set of common fields as well as optional key/value attributes that are typically event type-specific.'
osso
events
schema
data
interchange
formats
cep
event-processing
architecture
'a modern standard for representing event-oriented data in high-throughput operational systems. It uses existing open standards for schema definition and serialization, but adds semantic meaning and definition to make integration between systems easy, while still being size- and processing-efficient.
An Osso event is largely use case agnostic, and can represent a log message, stack trace, metric sample, user action taken, ad display or click, generic HTTP event, or otherwise. Every event has a set of common fields as well as optional key/value attributes that are typically event type-specific.'
september 2016 by jm
The problem of managing schemas
november 2014 by jm
Good post on the pain of using CSV/JSON as a data interchange format:
schema
json
avro
protobuf
csv
data-formats
interchange
data
hadoop
files
file-formats
eventually, the schema changes. Someone refactors the code generating the JSON and moves fields around, perhaps renaming few fields. The DBA added new columns to a MySQL table and this reflects in the CSVs dumped from the table. Now all those applications and scripts must be modified to handle both file formats. And since schema changes happen frequently, and often without warning, this results in both ugly and unmaintainable code, and in grumpy developers who are tired of having to modify their scripts again and again.
november 2014 by jm
Copy this bookmark: