Using AWS Batch to Generate Mapzen Terrain Tiles · Mapzen
mapzen
mapping
tiles
batch
aws
s3
lambda
docker
december 2017 by jm
Using this setup on AWS Batch, we are able to generate more than 3.75 million tiles per minute and render the entire world in less than a week! These pre-rendered tiles get stored in S3 and are ready to use by anyone through the AWS Public Dataset or through Mapzen’s Terrain Tiles API.
december 2017 by jm
Nextflow - A DSL for parallel and scalable computational pipelines
GPLv3 licensed, open source
computation
workflows
pipelines
batch
docker
ops
open-source
august 2017 by jm
Data-driven computational pipelines
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.
Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
GPLv3 licensed, open source
august 2017 by jm
Scheduled Tasks (cron) - Amazon EC2 Container Service
july 2017 by jm
ECS now does cron jobs. But where does AWS Batch fit in? confusing
aws
batch
ecs
cron
scheduling
recurrence
ops
july 2017 by jm
Best practices with Airflow
october 2016 by jm
interesting presentation describing how to architect Airflow ETL setups; see also https://gtoonstra.github.io/etl-with-airflow/principles.html
etl
airflow
batch
architecture
systems
ops
october 2016 by jm
ETL best practices with Airflow
october 2016 by jm
good advice on how to ETL
etl
airflow
documentation
best-practices
batch
architecture
october 2016 by jm
Submitting User Applications with spark-submit - AWS Big Data Blog
february 2016 by jm
looks reasonably usable, although EMR's crappy UI is still an issue
emr
big-data
spark
hadoop
yarn
map-reduce
batch
february 2016 by jm
The world beyond batch: Streaming 101 - O'Reilly Media
streaming
batch
big-data
lambda-architecture
dataflow
event-processing
cep
millwheel
data
data-processing
august 2015 by jm
To summarize, in this post I’ve:
Clarified terminology, specifically narrowing the definition of “streaming” to apply to execution engines only, while using more descriptive terms like unbounded data and approximate/speculative results for distinct concepts often categorized under the “streaming” umbrella.
Assessed the relative capabilities of well-designed batch and streaming systems, positing that streaming is in fact a strict superset of batch, and that notions like the Lambda Architecture, which are predicated on streaming being inferior to batch, are destined for retirement as streaming systems mature.
Proposed two high-level concepts necessary for streaming systems to both catch up to and ultimately surpass batch, those being correctness and tools for reasoning about time, respectively.
Established the important differences between event time and processing time, characterized the difficulties those differences impose when analyzing data in the context of when they occurred, and proposed a shift in approach away from notions of completeness and toward simply adapting to changes in data over time.
Looked at the major data processing approaches in common use today for bounded and unbounded data, via both batch and streaming engines, roughly categorizing the unbounded approaches into: time-agnostic, approximation, windowing by processing time, and windowing by event time.
august 2015 by jm
Discretized Streams: Fault Tolerant Stream Computing at Scale
june 2015 by jm
The paper describing the innards of Spark Streaming and its RDD-based recomputation algorithm:
rdd
spark
streaming
fault-tolerance
batch
distcomp
papers
big-data
scalability
we use a data structure called Resilient Distributed Datasets (RDDs), which keeps data in memory and can recover it without replication by tracking the lineage graph of operations that were used to build it. With RDDs, we show that we can attain sub-second end-to-end latencies. We believe that this is sufficient for many real-world big data applications, where the timescale of the events tracked (e.g., trends in social media) is much higher.
june 2015 by jm
Everyday I'm Shuffling - Tips for Writing Better Spark Programs [slides]
february 2015 by jm
Two Spark experts from Databricks provide some good tips
spark
performance
batch
ops
tips
slides
emr
february 2015 by jm
Are you better off running your big-data batch system off your laptop?
january 2015 by jm
Heh, nice trolling.
graph
coding
hadoop
spark
giraph
graph-processing
hardware
scalability
big-data
batch
algorithms
pagerank
Here are two helpful guidelines (for largely disjoint populations):
If you are going to use a big data system for yourself, see if it is faster than your laptop.
If you are going to build a big data system for others, see that it is faster than my laptop. [...]
We think everyone should have to do this, because it leads to better systems and better research.
january 2015 by jm
Spark 1.2 released
december 2014 by jm
This is the version with the superfast petabyte-sort record:
spark
sorting
hadoop
map-reduce
batch
databricks
apache
netty
Spark 1.2 includes several cross-cutting optimizations focused on performance for large scale workloads. Two new features Databricks developed for our world record petabyte sort with Spark are turned on by default in Spark 1.2. The first is a re-architected network transfer subsystem that exploits Netty 4’s zero-copy IO and off heap buffer management. The second is Spark’s sort based shuffle implementation, which we’ve now made the default after significant testing in Spark 1.1. Together, we’ve seen these features give as much as 5X performance improvement for workloads with very large shuffles.
december 2014 by jm
Spark Breaks Previous Large-Scale Sort Record – Databricks
october 2014 by jm
Massive improvement over plain old Hadoop. This blog post goes into really solid techie reasons why, including:
Also, use of Timsort, an external shuffle service to offload from the JVM, Netty, and EC2 SR-IOV.
spark
hadoop
map-reduce
batch
parallel
sr-iov
benchmarks
performance
netty
shuffle
algorithms
sort-based-shuffle
timsort
First and foremost, in Spark 1.1 we introduced a new shuffle implementation called sort-based shuffle (SPARK-2045). The previous Spark shuffle implementation was hash-based that required maintaining P (the number of reduce partitions) concurrent buffers in memory. In sort-based shuffle, at any given point only a single buffer is required. This has led to substantial memory overhead reduction during shuffle and can support workloads with hundreds of thousands of tasks in a single stage (our PB sort used 250,000 tasks).
Also, use of Timsort, an external shuffle service to offload from the JVM, Netty, and EC2 SR-IOV.
october 2014 by jm
Questioning the Lambda Architecture
july 2014 by jm
Jay Kreps (Kafka, Samza) with a thought-provoking post on the batch/stream-processing dichotomy
jay-kreps
toread
architecture
data
stream-processing
batch
hadoop
storm
lambda-architecture
july 2014 by jm
Twitter's TSAR
june 2014 by jm
TSAR = "Time Series AggregatoR". Twitter's new event processor-style architecture for internal metrics. It's notable that now Twitter and Google are both apparently moving towards this idea of a model of code which is designed to run equally in realtime streaming and batch modes (Summingbird, Millwheel, Flume).
analytics
architecture
twitter
tsar
aggregation
event-processing
metrics
streaming
hadoop
batch
june 2014 by jm
Netflix: Your Linux AMI: optimization and performance [slides]
december 2013 by jm
a fantastic bunch of low-level kernel tweaks and tunables which Netflix have found useful in production to maximise productivity of their fleet. Interesting use of SCHED_BATCH process scheduler class for batch processes, in particular. Also, great docs on their experience with perf and SystemTap. Perf really looks like a tool I need to get to grips with...
netflix
aws
tuning
ami
perf
systemtap
tunables
sched_batch
batch
hadoop
optimization
performance
december 2013 by jm
High Scalability - Analyzing billions of credit card transactions and serving low-latency insights in the cloud
february 2013 by jm
Hadoop, a batch-generated read-only Voldemort cluster, and an intriguing optimal-storage histogram bucketing algorithm:
scalability
scaling
voldemort
hadoop
batch
algorithms
histograms
statistics
bucketing
percentiles
The optimal histogram is computed using a random-restart hill climbing approximated algorithm.
The algorithm has been shown very fast and accurate: we achieved 99% accuracy compared to an exact dynamic algorithm, with a speed increase of one factor. [...] The amount of information to serve in Voldemort for one year of BBVA's credit card transactions on Spain is 270 GB. The whole processing flow would run in 11 hours on a cluster of 24 "m1.large" instances. The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month.
february 2013 by jm
How to beat the CAP theorem
october 2011 by jm
Nathan "Storm" Marz on building a dual realtime/batch stack. This lines up with something I've been building in work, so I'm happy ;)
nathan-marz
realtime
batch
hadoop
storm
big-data
cap
october 2011 by jm
related tags
aggregation ⊕ airbnb ⊕ airflow ⊕ algorithms ⊕ ami ⊕ analytics ⊕ apache ⊕ architecture ⊕ aws ⊕ batch ⊖ benchmarks ⊕ best-practices ⊕ big-data ⊕ bucketing ⊕ cap ⊕ cassandra ⊕ cep ⊕ coding ⊕ computation ⊕ cron ⊕ data ⊕ data-processing ⊕ databricks ⊕ dataflow ⊕ distcomp ⊕ docker ⊕ documentation ⊕ ecs ⊕ emr ⊕ etl ⊕ event-processing ⊕ fault-tolerance ⊕ giraph ⊕ graph ⊕ graph-processing ⊕ hadoop ⊕ hardware ⊕ histograms ⊕ jay-kreps ⊕ jobs ⊕ kafka ⊕ lambda ⊕ lambda-architecture ⊕ map-reduce ⊕ mapping ⊕ mapzen ⊕ metrics ⊕ millwheel ⊕ nathan-marz ⊕ netflix ⊕ netty ⊕ open-source ⊕ ops ⊕ optimization ⊕ pagerank ⊕ papers ⊕ parallel ⊕ percentiles ⊕ perf ⊕ performance ⊕ pipelines ⊕ python ⊕ rdd ⊕ realtime ⊕ recurrence ⊕ s3 ⊕ scalability ⊕ scaling ⊕ scheduling ⊕ sched_batch ⊕ shuffle ⊕ slides ⊕ sort-based-shuffle ⊕ sorting ⊕ spark ⊕ spark-streaming ⊕ sr-iov ⊕ statistics ⊕ storm ⊕ stream-processing ⊕ streaming ⊕ systems ⊕ systemtap ⊕ tiles ⊕ timsort ⊕ tips ⊕ toread ⊕ tsar ⊕ tunables ⊕ tuning ⊕ twitter ⊕ voldemort ⊕ workflow ⊕ workflows ⊕ yarn ⊕Copy this bookmark: