jm + batch   17

Scheduled Tasks (cron) - Amazon EC2 Container Service
ECS now does cron jobs. But where does AWS Batch fit in? confusing
aws  batch  ecs  cron  scheduling  recurrence  ops 
10 days ago by jm
Best practices with Airflow
interesting presentation describing how to architect Airflow ETL setups; see also https://gtoonstra.github.io/etl-with-airflow/principles.html
etl  airflow  batch  architecture  systems  ops 
october 2016 by jm
Submitting User Applications with spark-submit - AWS Big Data Blog
looks reasonably usable, although EMR's crappy UI is still an issue
emr  big-data  spark  hadoop  yarn  map-reduce  batch 
february 2016 by jm
The world beyond batch: Streaming 101 - O'Reilly Media
To summarize, in this post I’ve:

Clarified terminology, specifically narrowing the definition of “streaming” to apply to execution engines only, while using more descriptive terms like unbounded data and approximate/speculative results for distinct concepts often categorized under the “streaming” umbrella.

Assessed the relative capabilities of well-designed batch and streaming systems, positing that streaming is in fact a strict superset of batch, and that notions like the Lambda Architecture, which are predicated on streaming being inferior to batch, are destined for retirement as streaming systems mature.

Proposed two high-level concepts necessary for streaming systems to both catch up to and ultimately surpass batch, those being correctness and tools for reasoning about time, respectively.

Established the important differences between event time and processing time, characterized the difficulties those differences impose when analyzing data in the context of when they occurred, and proposed a shift in approach away from notions of completeness and toward simply adapting to changes in data over time.

Looked at the major data processing approaches in common use today for bounded and unbounded data, via both batch and streaming engines, roughly categorizing the unbounded approaches into: time-agnostic, approximation, windowing by processing time, and windowing by event time.
streaming  batch  big-data  lambda-architecture  dataflow  event-processing  cep  millwheel  data  data-processing 
august 2015 by jm
Discretized Streams: Fault Tolerant Stream Computing at Scale
The paper describing the innards of Spark Streaming and its RDD-based recomputation algorithm:
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.
rdd  spark  streaming  fault-tolerance  batch  distcomp  papers  big-data  scalability 
june 2015 by jm
Airflow
Airbnb's workflow management system; works off a DAG defined in Python code (ugh). Nice UI though, but I think Pinboard's take is neater
airbnb  open-source  python  workflow  jobs  cron  scheduling  batch 
june 2015 by jm
Are you better off running your big-data batch system off your laptop?
Heh, nice trolling.
Here are two helpful guidelines (for largely disjoint populations):

If you are going to use a big data system for yourself, see if it is faster than your laptop.
If you are going to build a big data system for others, see that it is faster than my laptop. [...]

We think everyone should have to do this, because it leads to better systems and better research.
graph  coding  hadoop  spark  giraph  graph-processing  hardware  scalability  big-data  batch  algorithms  pagerank 
january 2015 by jm
Spark 1.2 released
This is the version with the superfast petabyte-sort record:
Spark 1.2 includes several cross-cutting optimizations focused on performance for large scale workloads. Two new features Databricks developed for our world record petabyte sort with Spark are turned on by default in Spark 1.2. The first is a re-architected network transfer subsystem that exploits Netty 4’s zero-copy IO and off heap buffer management. The second is Spark’s sort based shuffle implementation, which we’ve now made the default after significant testing in Spark 1.1. Together, we’ve seen these features give as much as 5X performance improvement for workloads with very large shuffles.
spark  sorting  hadoop  map-reduce  batch  databricks  apache  netty 
december 2014 by jm
Spark Breaks Previous Large-Scale Sort Record – Databricks
Massive improvement over plain old Hadoop. This blog post goes into really solid techie reasons why, including:
First and foremost, in Spark 1.1 we introduced a new shuffle implementation called sort-based shuffle (SPARK-2045). The previous Spark shuffle implementation was hash-based that required maintaining P (the number of reduce partitions) concurrent buffers in memory. In sort-based shuffle, at any given point only a single buffer is required. This has led to substantial memory overhead reduction during shuffle and can support workloads with hundreds of thousands of tasks in a single stage (our PB sort used 250,000 tasks).


Also, use of Timsort, an external shuffle service to offload from the JVM, Netty, and EC2 SR-IOV.
spark  hadoop  map-reduce  batch  parallel  sr-iov  benchmarks  performance  netty  shuffle  algorithms  sort-based-shuffle  timsort 
october 2014 by jm
Questioning the Lambda Architecture
Jay Kreps (Kafka, Samza) with a thought-provoking post on the batch/stream-processing dichotomy
jay-kreps  toread  architecture  data  stream-processing  batch  hadoop  storm  lambda-architecture 
july 2014 by jm
Twitter's TSAR
TSAR = "Time Series AggregatoR". Twitter's new event processor-style architecture for internal metrics. It's notable that now Twitter and Google are both apparently moving towards this idea of a model of code which is designed to run equally in realtime streaming and batch modes (Summingbird, Millwheel, Flume).
analytics  architecture  twitter  tsar  aggregation  event-processing  metrics  streaming  hadoop  batch 
june 2014 by jm
Netflix: Your Linux AMI: optimization and performance [slides]
a fantastic bunch of low-level kernel tweaks and tunables which Netflix have found useful in production to maximise productivity of their fleet. Interesting use of SCHED_BATCH process scheduler class for batch processes, in particular. Also, great docs on their experience with perf and SystemTap. Perf really looks like a tool I need to get to grips with...
netflix  aws  tuning  ami  perf  systemtap  tunables  sched_batch  batch  hadoop  optimization  performance 
december 2013 by jm
High Scalability - Analyzing billions of credit card transactions and serving low-latency insights in the cloud
Hadoop, a batch-generated read-only Voldemort cluster, and an intriguing optimal-storage histogram bucketing algorithm:
The optimal histogram is computed using a random-restart hill climbing approximated algorithm.
The algorithm has been shown very fast and accurate: we achieved 99% accuracy compared to an exact dynamic algorithm, with a speed increase of one factor. [...] The amount of information to serve in Voldemort for one year of BBVA's credit card transactions on Spain is 270 GB. The whole processing flow would run in 11 hours on a cluster of 24 "m1.large" instances. The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month.
scalability  scaling  voldemort  hadoop  batch  algorithms  histograms  statistics  bucketing  percentiles 
february 2013 by jm
How to beat the CAP theorem
Nathan "Storm" Marz on building a dual realtime/batch stack. This lines up with something I've been building in work, so I'm happy ;)
nathan-marz  realtime  batch  hadoop  storm  big-data  cap 
october 2011 by jm

Copy this bookmark:



description:


tags: