Spark on EMR

Running Spark apps on EMR and EC2 Spot Instances

Previously I explained the advantages of using EC2 Spot Instances(spare compute capacity price model) to lower costs of running any data engineering task workload(ETL, log analysis, ML, Web indexing, Financial or Statistical Analysis etc) on Amazon EMR. Now I'll demonstrate practically how I use it to run Spark applications on a daily basis.

Meeting technical specifications

Steps

Step 1 - Launch cluster

For transient clusters EMR have the "Auto-terminate cluster after the last step is completed" feature which is great for periodic processing tasks such as daily data processing  run, event-driven ETL workloads etc.

Step 2 - instance group configuration

Step 3 - EMR Master, Core and Task nodes setup

Step 4 - Using tag

Step 5 - Fleet configuration options

Let's use "Max-price" field for our Spot requests to take advantage of Maximum spot price, best practice is to leave 100% of the On-demand price.

Set the Total units to the number of vCPUs we want our cluster to run  so EMR can select the best instances while respecting the number of instances to run

Play with defined durations to run instances on Spot Blocks(uninterrupted Spot Instances), available for 1 to 6 hours at lower discount compared to Spot Instances 

Provisioning timeout will affect for how long EMR will attempt to to provision our selected Spot Instances due to lack of capacity, either terminating the cluster or start On-demand instances.