parquet   306

« earlier    

apache/carbondata: Mirror of Apache CarbonData
CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc, and CarbonData has following unique features:

Stores data along with index: it can significantly accelerate query performance and reduces the I/O scans and CPU resources, where there are filters in the query. CarbonData index consists of multiple level of indices, a processing framework can leverage this index to reduce the task it needs to schedule and process, and it can also do skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file.
Operable encoded data :Through supporting efficient compression and global encoding schemes, can query on compressed/encoded data, the data can be converted just before returning the results to the users, which is "late materialized".
Supports for various use cases with one single Data format : like interactive OLAP-style query, Sequential Access (big scan), Random Access (narrow scan).
columnar  storage  parquet  carbondata 
june 2019 by griddell
Spark File Format Showdown – CSV vs JSON vs Parquet – Garren's [Big] Data Blog
Splittable (definition): Spark likes to split 1 single input file into multiple chunks (partitions to be precise) so that it [Spark] can work on many partitions at one time (re: concurrently).
Spark  Parquet  JSON 
april 2019 by colin.jack
cldellow/sqlite-parquet-vtable: A SQLite vtable extension to read Parquet files
A SQLite vtable extension to read Parquet files. Contribute to cldellow/sqlite-parquet-vtable development by creating an account on GitHub.
sqlite  parquet 
april 2019 by jberkel
GitHub - jwhitbeck/dendrite: Dendrite is a library for querying large datasets on a single host at near-interactive speeds.
Dendrite is a library for querying large datasets on a single host at near-interactive speeds. - jwhitbeck/dendrite
data  querying  large  datasets  file  database  clojure  fast  dremel  record  shredding  parquet  java  jvm 
april 2019 by fmjrey

« earlier    

related tags

a500  adapter  analysis  apache  arrow  arvo  athena  avro  aws  batch  benchmark  benchmarks  big-data  bigdata  bois  calcite  carbondata  cassandra  clickhouse  clojure  column-store  column  column_database  columnar  compatibility  compression  conversion  converter  cpp  csv  dask  data-lake  data-science  data  database  dataframe  datasets  discussion  docs  documentation  dremel  dremio  drill  druid  dynamodb  emr  engineering  entretien  etldev  experience  fast  feather  file  files  filodb  format  glue  hadoop  hbase  hdf5  hdfs  hive  impala  java  json  jvm  kafka  kinesis  kryo  kudu  large  library  maison  management  memory  metadata  modeling  nullable  olap  opensource  orc  overview  pandas  parsing  perf  performance  plan  postgres  postgresql  presentation  presto  programming  protobuf  pyarrow  pyspark  python  querying  r  record  redshift  s3  shredding  sizes  slack  slides  software  spark  sparser  sql  sqlite  storage  suelo  table  terra  testing  toolchain  tuning  tutorial  uber  virtual  vpc-flow-logs  vpc  vs 

Copy this bookmark: