jm + shuffle   2

Spark Breaks Previous Large-Scale Sort Record – Databricks
Massive improvement over plain old Hadoop. This blog post goes into really solid techie reasons why, including:
First and foremost, in Spark 1.1 we introduced a new shuffle implementation called sort-based shuffle (SPARK-2045). The previous Spark shuffle implementation was hash-based that required maintaining P (the number of reduce partitions) concurrent buffers in memory. In sort-based shuffle, at any given point only a single buffer is required. This has led to substantial memory overhead reduction during shuffle and can support workloads with hundreds of thousands of tasks in a single stage (our PB sort used 250,000 tasks).

Also, use of Timsort, an external shuffle service to offload from the JVM, Netty, and EC2 SR-IOV.
spark  hadoop  map-reduce  batch  parallel  sr-iov  benchmarks  performance  netty  shuffle  algorithms  sort-based-shuffle  timsort 
october 2014 by jm

Copy this bookmark: