jm + hive   6

Apache Iceberg (incubating)
Coming to presto soon apparently....
Iceberg tracks individual data files in a table instead of directories. This allows writers to create data files in-place and only adds files to the table in an explicit commit.

Table state is maintained in metadata files. All changes to table state create a new metadata file and replace the old metadata with an atomic operation. The table metadata file tracks the table schema, partitioning config, other properties, and snapshots of the table contents.

The atomic transitions from one table metadata file to the next provide snapshot isolation. Readers use the latest table state (snapshot) that was current when they load the table metadata and are not affected by changes until they refresh and pick up a new metadata location.


excellent -- this will let me obsolete so much of our own code :)
presto  storage  s3  hive  iceberg  apache  asf  data  architecture 
5 weeks ago by jm
how Curator fixed issues with the Hive ZooKeeper Lock Manager Implementation
Ugh, ZK is a bear to work with.
Apache Curator is open source software which is able to handle all of the above scenarios transparently. Curator is a Netflix ZooKeeper Library and it provides a high-level API, CuratorFramework, that simplifies using ZooKeeper. By using a singleton CuratorFramework instance in the new ZooKeeperHiveLockManager implementation, we not only fixed the ZooKeeper connection issues, but also made the code easy to understand and maintain.  
zookeeper  apis  curator  netflix  distributed-locks  coding  hive 
february 2015 by jm
Luigi
A really excellent-looking workflow/orchestration engine for Hadoop, Pig, Hive, Redshift and other ETL jobs, featuring inter-job dependencies, cron-like scheduling, and failure handling. Open source, from Spotify
workflow  orchestration  scheduling  cron  spotify  open-source  luigi  redshift  pig  hive  hadoop  emr  jobs  make  dependencies 
july 2014 by jm
Presto: Interacting with petabytes of data at Facebook
Presto has become a major interactive system for the company’s data warehouse. It is deployed in multiple geographical regions and we have successfully scaled a single cluster to 1,000 nodes. The system is actively used by over a thousand employees,who run more than 30,000 queries processing one petabyte daily.

Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook. It currently supports a large subset of ANSI SQL, including joins, left/right outer joins, subqueries,and most of the common aggregate and scalar functions, including approximate distinct counts (using HyperLogLog) and approximate percentiles (based on quantile digest). The main restrictions at this stage are a size limitation on the join tables and cardinality of unique keys/groups. The system also lacks the ability to write output data back to tables (currently query results are streamed to the client).
facebook  hadoop  hdfs  open-source  java  sql  hive  map-reduce  querying  olap 
november 2013 by jm
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
reasonably good whole-stack performance testing and analysis; HBase, Riak, MongoDB, and Cassandra compared. Riak did pretty badly :(
riak  mongodb  cassandra  hbase  performance  analytics  hadoop  hive  big-data  storage  databases  nosql 
february 2013 by jm
The innards of Evernote's new business analytics data warehouse
replacing a giant MySQL star-schema reporting server with a Hadoop/Hive/ParAccel cluster
horizontal-scaling  scalability  bi  analytics  reporting  evernote  via:highscalability  hive  hadoop  paraccel 
december 2012 by jm

Copy this bookmark:



description:


tags: