jm + emr   12

Spark memory tuning on EMR
'Best practices for successfully managing memory for Apache Spark applications on Amazon EMR', on the AWS Big Data blog.

'In this blog post, I detailed the possible out-of-memory errors, their causes, and a list of best practices to prevent these errors when submitting a Spark application on Amazon EMR.

My colleagues and I formed these best practices after thorough research and understanding of various Spark configuration properties and testing multiple Spark applications. These best practices apply to most of out-of-memory scenarios, though there might be some rare scenarios where they don’t apply. However, we believe that this blog post provides all the details needed so you can tweak parameters and successfully run a Spark application.'
spark  emr  aws  tuning  memory  ooms  java 
11 weeks ago by jm
Submitting User Applications with spark-submit - AWS Big Data Blog
looks reasonably usable, although EMR's crappy UI is still an issue
emr  big-data  spark  hadoop  yarn  map-reduce  batch 
february 2016 by jm
Can Spark Streaming survive Chaos Monkey?
good empirical results on Spark's resilience to network/host outages in EC2
ec2  aws  emr  spark  resilience  ha  fault-tolerance  chaos-monkey  netflix 
march 2015 by jm
Elastic MapReduce vs S3
Turns out there are a few bugs in EMR's S3 support, believe it or not.

1. 'Consider disabling Hadoop's speculative execution feature if your cluster is experiencing Amazon S3 concurrency issues. You do this through the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution configuration settings. This is also useful when you are troubleshooting a slow cluster.'

2. Upgrade to AMI 3.1.0 or later, otherwise retries of S3 ops don't work.
s3  emr  hadoop  aws  bugs  speculative-execution  ops 
october 2014 by jm
Inviso: Visualizing Hadoop Performance
With the increasing size and complexity of Hadoop deployments, being able to locate and understand performance is key to running an efficient platform.  Inviso provides a convenient view of the inner workings of jobs and platform.  By simply overlaying a new view on existing infrastructure, Inviso can operate inside any Hadoop environment with a small footprint and provide easy access and insight.  


This sounds pretty useful.
inviso  netflix  hadoop  emr  performance  ops  tools 
september 2014 by jm
Profiling Hadoop jobs with Riemann
I’ve built a very simple distributed profiler for soft-real-time telemetry from hundreds to thousands of JVMs concurrently. It’s nowhere near as comprehensive in its analysis as, say, Yourkit, but it can tell you, across a distributed system, which functions are taking the most time, and what their dominant callers are.


Potentially useful.
riemann  profiling  aphyr  hadoop  emr  performance  monitoring 
august 2014 by jm
Luigi
A really excellent-looking workflow/orchestration engine for Hadoop, Pig, Hive, Redshift and other ETL jobs, featuring inter-job dependencies, cron-like scheduling, and failure handling. Open source, from Spotify
workflow  orchestration  scheduling  cron  spotify  open-source  luigi  redshift  pig  hive  hadoop  emr  jobs  make  dependencies 
july 2014 by jm

Copy this bookmark:



description:


tags: