jm + spark   21

Open Sourcing Dr. Elephant: Self-Serve Performance Tuning for Hadoop and Spark
[LinkedIn] are proud to announce today that we are open sourcing Dr. Elephant, a powerful tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows.

neat, although I've been bitten too many times by LinkedIn OSS release quality at this point to jump in....
linkedin  oss  hadoop  spark  performance  tuning  ops 
april 2016 by jm
Submitting User Applications with spark-submit - AWS Big Data Blog
looks reasonably usable, although EMR's crappy UI is still an issue
emr  big-data  spark  hadoop  yarn  map-reduce  batch 
february 2016 by jm
Analysing user behaviour - from histograms to random forests (PyData) at PyCon Ireland 2015 | Lanyrd
Swrve's own Dave Brodigan on game user-data analysis techniques:
The goal is to give the audience a roadmap for analysing user data using python friendly tools.

I will touch on many aspects of the data science pipeline from data cleansing to building predictive data products at scale.

I will start gently with pandas and dataframes and then discuss some machine learning techniques like kmeans and random forests in scikitlearn and then introduce Spark for doing it at scale.

I will focus more on the use cases rather than detailed implementation.

The talk will be informed by my experience and focus on user behaviour in games and mobile apps.
swrve  talks  user-data  big-data  spark  hadoop  machine-learning  data-science 
october 2015 by jm
How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code
using Spark, Tesseract, HBase, Solr and Leptonica. Actually pretty feasible
spark  tesseract  hbase  solr  leptonica  pdfs  scanning  cloudera  hadoop  architecture 
october 2015 by jm
Discretized Streams: Fault Tolerant Stream Computing at Scale
The paper describing the innards of Spark Streaming and its RDD-based recomputation algorithm:
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.
rdd  spark  streaming  fault-tolerance  batch  distcomp  papers  big-data  scalability 
june 2015 by jm
"Trash Day: Coordinating Garbage Collection in Distributed Systems"
Another GC-coordination strategy, similar to Blade (qv), with some real-world examples using Cassandra
blade  via:adriancolyer  papers  gc  distsys  algorithms  distributed  java  jvm  latency  spark  cassandra 
may 2015 by jm
Can Spark Streaming survive Chaos Monkey?
good empirical results on Spark's resilience to network/host outages in EC2
ec2  aws  emr  spark  resilience  ha  fault-tolerance  chaos-monkey  netflix 
march 2015 by jm
Are you better off running your big-data batch system off your laptop?
Heh, nice trolling.
Here are two helpful guidelines (for largely disjoint populations):

If you are going to use a big data system for yourself, see if it is faster than your laptop.
If you are going to build a big data system for others, see that it is faster than my laptop. [...]

We think everyone should have to do this, because it leads to better systems and better research.
graph  coding  hadoop  spark  giraph  graph-processing  hardware  scalability  big-data  batch  algorithms  pagerank 
january 2015 by jm
Spark 1.2 released
This is the version with the superfast petabyte-sort record:
Spark 1.2 includes several cross-cutting optimizations focused on performance for large scale workloads. Two new features Databricks developed for our world record petabyte sort with Spark are turned on by default in Spark 1.2. The first is a re-architected network transfer subsystem that exploits Netty 4’s zero-copy IO and off heap buffer management. The second is Spark’s sort based shuffle implementation, which we’ve now made the default after significant testing in Spark 1.1. Together, we’ve seen these features give as much as 5X performance improvement for workloads with very large shuffles.
spark  sorting  hadoop  map-reduce  batch  databricks  apache  netty 
december 2014 by jm
Roaring Bitmaps
Bitsets, also called bitmaps, are commonly used as fast data structures. Unfortunately, they can use too much memory. To compensate, we often use compressed bitmaps. Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be hundreds of times faster and they often offer significantly better compression.

Roaring bitmaps are used in Apache Lucene (as of version 5.0 using an independent implementation) and Apache Spark (as of version 1.2).
bitmaps  bitsets  sets  data-structures  bits  compression  lucene  spark  daniel-lemire  algorithms 
november 2014 by jm
Spark Breaks Previous Large-Scale Sort Record – Databricks
Massive improvement over plain old Hadoop. This blog post goes into really solid techie reasons why, including:
First and foremost, in Spark 1.1 we introduced a new shuffle implementation called sort-based shuffle (SPARK-2045). The previous Spark shuffle implementation was hash-based that required maintaining P (the number of reduce partitions) concurrent buffers in memory. In sort-based shuffle, at any given point only a single buffer is required. This has led to substantial memory overhead reduction during shuffle and can support workloads with hundreds of thousands of tasks in a single stage (our PB sort used 250,000 tasks).

Also, use of Timsort, an external shuffle service to offload from the JVM, Netty, and EC2 SR-IOV.
spark  hadoop  map-reduce  batch  parallel  sr-iov  benchmarks  performance  netty  shuffle  algorithms  sort-based-shuffle  timsort 
october 2014 by jm
Integrating Kafka and Spark Streaming: Code Examples and State of the Game
Spark Streaming has been getting some attention lately as a real-time data processing tool, often mentioned alongside Apache Storm. [...] I added an example Spark Streaming application to kafka-storm-starter that demonstrates how to read from Kafka and write to Kafka, using Avro as the data format and Twitter Bijection for handling the data serialization. In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state of Kafka integration in Spark Streaming. All this with the disclaimer that this happens to be my first experiment with Spark Streaming.
spark  kafka  realtime  architecture  queues  avro  bijection  batch-processing 
october 2014 by jm
Spark Streaming
an extension of the core Spark API that allows enables high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or plain old TCP sockets and be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s in-built machine learning algorithms, and graph processing algorithms on data streams.
spark  streams  stream-processing  cep  scalability  apache  machine-learning  graphs 
may 2014 by jm
Spark - A small web framework for Java
A Sinatra-like minimal web framework built on Java 8 lambdas:

<code>public class HelloWorld {
public static void main(String[] args) {
get("/hello", (request, response) -> {
return "Hello World!";
via:sampullara  web  java  sinatra  lambdas  closures  java8  spark 
may 2014 by jm

Copy this bookmark: