F2F - Spark on EMR

Spark on EMR

Running Spark apps on EMR and EC2 Spot Instances

Previously I explained the advantages of using EC2 Spot Instances(spare compute capacity price model) to lower costs of running any data engineering task workload(ETL, log analysis, ML, Web indexing, Financial or Statistical Analysis etc) on Amazon EMR. Now I'll demonstrate practically how I use it to run Spark applications on a daily basis.

Meeting technical specifications

Using the Spot Instances Advisor find the most adequate instance types per fleet(vCPU/memory ratio and interruption time) to Spark executor sizes(memory, cores) and default EMR YARN limits. That will maintain lower Spot interruption rates and keep capacity for the cluster(s). Basic idea is to right size executors to fit on smaller instance types allowing larger instance type selection(diversified set of EC2 instance types) that EMR can quickly provision to keep up with demand.
As Spark dynamically allocate executors across instances it won't be any problem to set "-executor-memory=18 -executor-cores=4" to select an r4.2xlarge EC2 instance type enabling it to run more than 1 executor on each instance.

Steps

Launch cluster(running on Spot Instances) and check Hadoop, Spark and Ganglia
Under Hardware Configuration select Instance fleets in the Instance Group Configuration section then choose a previously created VPC or a new one and select all multiple subnets for EMR to best select available cheaper instance types across multiple availability zones
Setup EMR Master, Core and Task instance fleets
Use Tags to tag our instance for later track cost and logs
Examine fleet configuration options max-price, total units to the number of vCPUS in our diversified, defined duration and provisioning timeout

Step 1 - Launch cluster

Open the EMR console in the region where you are looking to launch your cluster.
Click “Create Cluster“
Click “Go to advanced options“
Select the latest EMR release, and in the list of components, only leave Hadoop checked and also check Spark and Ganglia (we will use it later to monitor our cluster)
Under “Steps (Optional)” -> Step type drop down menu, select “Spark application” and click Add step, then add the following details in the Add step dialog window:

For transient clusters EMR have the "Auto-terminate cluster after the last step is completed" feature which is great for periodic processing tasks such as daily data processing run, event-driven ETL workloads etc.

Step 2 - instance group configuration

Step 3 - EMR Master, Core and Task nodes setup

Step 4 - Using tag

Step 5 - Fleet configuration options

Let's use "Max-price" field for our Spot requests to take advantage of Maximum spot price, best practice is to leave 100% of the On-demand price.

Set the Total units to the number of vCPUs we want our cluster to run so EMR can select the best instances while respecting the number of instances to run

Play with defined durations to run instances on Spot Blocks(uninterrupted Spot Instances), available for 1 to 6 hours at lower discount compared to Spot Instances

Provisioning timeout will affect for how long EMR will attempt to to provision our selected Spot Instances due to lack of capacity, either terminating the cluster or start On-demand instances.