Amazon EMR Reduced Costs

Managing Price of Big Data Cloud Workloads

If you usually work with Amazon EMR, specifically for projects that need Hadoop, Spark, HBase, Presto, Flink or any other distributed framework prices can always scale up pretty quickly.

So I'll share a recent project configuration that helped to reduce project costs while allowing higher compute capacity, and reduced time to process large data sets.   

What will be used

The best outcomes can be achieved by choosing Amazon EC2 Spot Instances, and playing with a couple of cluster configurations such as: 

Instances fleets for instance diversification(up to five multiple instance types per fleet). Best practice is to choose types with similar vcpu to memory ratios

Try different On-demand and spot units according to each instance type for higher cluster capacity. They match the number of vCores for each instance type. 

For example heavier workloads should have higher On-Demand unit and spot unit than other instance types with more ephemeral tasks. So to lower prices you can deploy a small number of Spot Units and no On-demand units.

I also try to experiment with two different configurations(Defined duration and provisioning timeout) of cluster behavior to better enable Amazon EMR to provision capacity for instance fleets.

.Look the image below where we could get five spot instances with support to EMR with up 88% discount over On-demand, US East(Ohio) region and Linux/Unix OS, price savings calculated over the last 30 days and averaged across Availability Zones.