Querying 347 Billion Records using EMR Spark: a Performance and Cost Analysis

As an application owner, it is increasingly important to have the ability to analyze large amounts of data. Thankfully, cloud-based Big Data Analytics tools are a widely adopted way to achieve this purpose and AWS offers multiple solutions in this area, particularly the Elastic Map Reduce (EMR) service. EMR delivers a managed solution to deploy tools such as Hadoop, Spark, Hive and Presto, among others, and deploy them in cloud-based compute infrastructure such as EC2 or EKS, or even on-premise servers.

In this article I will focus on Apache Spark, managed by the EMR service, running on EC2 instances.

Infrastructure and Data Setup

Below are the infrastructure and data specifications I used for these tests:

  • Test data compliant with the TPC-H benchmark specifications. Tests were executed against 10TB and 30TB datasets in Parquet format. The 10TB TPC-H dataset has close to 87 billion records, while the 30TB dataset has close to 260 billion records (for a total of 347 billion records queried in these tests).
  • Data was stored in S3 in the N. Virginia region.
  • EMR version 6.11.0 (Spark 3.3.2).
  • For EMR Spark to access data, a metadata catalog is required. This was achieved by launching a Hive Metastore, using AWS EMR v.6.11.0 (Hive 3.1.3), with an m5.large EC2 instance and backed by a db.t3.medium AWS RDS MySQL DB instance.
  • Tests were executed with clusters consisting of 1 primary node and a different number of executor nodes, depending on the test scenario, all running on r5.8xlarge EC2 instances. Below are the cluster sizes used for these tests:
    • 1 Coordinator (r5.8xlarge) + 40 executor nodes (r5.8xlarge)
    • 1 Coordinator (r5.8xlarge) + 100 executor nodes (r5.8xlarge)
  • For the 10TB tests, the default EBS configuration was applied: 4 128 GB gp2 volumes per node.
  • For the 30TB tests, there was 1 EBS volume attached to each node, with the following settings: VolumeType=gp3, SizeInGB=1000, Iops=3000, Throughput=512 MiB/s
  • Configured default Spark settings and EMR configuration maximizeResourceAllocation was set to true. EMR delivers an automated way to launch and manage the compute infrastructure of data analytics clusters. It’s important to highlight that cluster launch and management automation can be implemented using the AWS CLI, SDK, or even tools such as CloudFormation.

EMR delivers an automated way to launch and manage the compute infrastructure of data analytics clusters. It’s important to highlight that cluster launch and management automation can be implemented using the AWS CLI, SDK, or even tools such as CloudFormation.

Test Scenarios and Execution

Each test ran the 22 queries defined in the TPC-H benchmark. The whole set of 22 queries was executed 3 times, in a sequential way (queries 1 through 22, executed 3 times). For each query, the average of the 3 executions is the result displayed in the tables and charts in the Test Results section below.

The following test scenarios were executed:

  • 10TB - 1 primary node + 40 executor nodes
  • 10TB - 1 primary node + 100 executor nodes
  • 30TB - 1 primary node + 40 executor nodes
  • 30TB - 1 primary node + 100 executor nodes

Test Results

All queries completed successfully for the scenarios in scope. Below are the results for the 10TB scenarios:

TPC-H results - 40 workers

For the 10TB, 40-executors scenario, the average query execution time was 158 seconds.

TPC-H results - 100 workers

For the 10TB, 100-executors scenario, the average query execution time was 93 seconds (62% the average execution time compared to using 40 executors).

Below are the results for the 30TB scenarios:

TPC-H results - 40 workers

For the 30TB, 40-executors scenario, the average query execution time was 501 seconds (3.17x compared to the 10TB scenario with the same number of nodes).

TPC-H results - 40 workers

For the 30TB, 100-executors scenario, the average query execution time was 217 seconds (49% the average execution time compared to using 40 executors). 30TB took 2.33x more time to execute compared to 10TB, when using 100 executor nodes.

TPC-H results - 40 vs 100 workers

TPC-H results - 40 vs 100 workers

The table below displays the individual query execution times, in seconds, per scenario:

QID 10TB -
40 executors
10TB -
100 executors
30TB -
40 executors
30TB -
100 executors
q01 108 62 193 91
q02 62 51 118 81
q03 140 70 480 231
q04 208 120 381 217
q05 246 139 888 362
q06 83 45 140 66
q07 137 84 341 169
q08 184 97 457 242
q09 275 139 857 362
q10 143 76 393 171
q11 116 109 475 319
q12 94 64 216 103
q13 125 78 299 169
q14 91 56 219 98
q15 163 82 325 146
q16 58 50 143 95
q17 254 149 1167 440
q18 245 148 1444 429
q19 114 62 206 95
q20 118 68 258 140
q21 439 233 1841 620
q22 75 57 189 118
Average 158 93 501 217
Total (mins) 58 34 184 79

Below is a comparison between the total average execution time in each scenario, relative to other scenarios:

relative to:
10TB - 40 executors 10TB - 100 executors 30TB - 40 executors 30TB - 100 executors
10TB - 40 executors N/A 171% 32% 73%
10TB - 100 executors 59% N/A 18% 43%
30TB - 40 executors 317% 541% N/A 232%
30TB - 100 executors 137% 234% 43% N/A


AWS Cost Analysis

The following tables compare the cost of running a cluster consisting of 40 executors + 1 coordinator as well as 100 executors + 1 coordinator and the default EMR configuration consisting of 512 GB of gp2 EBS storage per node, in the N. Virginia region.

40 executors + 1 coordinator:

Component Price Dimension Usage Hourly Cost Monthly Cost
Data store S3 Standard Storage 2.7TB (Parquet 10TB) + 7.8TB (Parquet 30TB) = 10.5TB $0.34 $242
Coordinator Node EC2 Compute 1 r5.8xlarge instance $2.02 $1,452
Executor Nodes EC2 Compute 40 r5.8xlarge instances $80.64 $58,061
EMR Fee - Coordinator $0.27/hr (r5.8xlarge) 1 r5.8xlarge instance $0.27 $194
EMR Fee - Executors $0.27/hr (r5.8xlarge) 40 r5.8xlarge instances $10.80 $7,776
Hive Metastore EC2 Compute + EMR fee 1 m4.large $0.12 $86
RDS MySQL Hive Metastore storage 1 db.t3.medium $0.07 $49
EBS Storage EBS Volume Usage - gp2 512GB x 41 = 20992GB $2.92 $2,099
Data Transfer S3 to EC2 - Intraregional Data Transfer N/A $0.00 $0.00
Total $97.16 $69,959


100 executors + 1 coordinator:

Component Price Dimension Usage Hourly Cost Monthly Cost
Data store S3 Standard Storage 2.7TB (Parquet 10TB) + 7.8TB (Parquet 30TB) = 10.5TB $0.34 $242
Coordinator Node EC2 Compute 1 r5.8xlarge instance $2.02 $1,452
Executor Nodes EC2 Compute 100 r5.8xlarge instances $201.60 $145,152
EMR Fee - Coordinator $0.27/hr (r5.8xlarge) 1 r5.8xlarge instance $0.27 $194
EMR Fee - Executors $0.27/hr (r5.8xlarge) 100 r5.8xlarge instances $27.00 $19,440
Hive Metastore EC2 Compute + EMR fee 1 m4.large $0.12 $86
RDS MySQL Hive Metastore storage 1 db.t3.medium $0.07 $49
EBS Storage EBS Volume Usage - gp2 512GB x 101 = 51712GB $7.18 $5,171
Data Transfer S3 to EC2 - Intraregional Data Transfer N/A $0.00 $0
Total $238.59 $171,786

30TB tests had a different EBS configuration compared to the 10TB tests. Instead of 512GB gp2 EBS storage per node: 1,000 GB gp3 storage, 3,000 IOPS and 512 MB/s throughput.

Storage IOPS Throughput
Cluster Size 1,000 GB 3,000 512 MB/s Total / month Delta vs.default EBS configuration
40 executors + 1 coordinator $3,280 $615 $840 $4,735 $2,635
100 executors + 1 coordinator $8,080 $1,515 $2,068 $11,663 $6,492

It is important to mention that intra-regional Data Transfer was zero, since all components were launched in the same AWS region and VPC subnet, otherwise there would be significant data transfer incurred, which would likely result in hundreds of dollars per each TPC-H test set.

The always-on pricing is for illustration purposes only, since the recommendation is to dynamically scale in and out based on usage.

Below you can visualize how the hourly cost for a particular cluster size (i.e. 40 vs. 100 executors) had an impact on the total execution time for each dataset size.


Hourly Cost - 40 vs 100 workers - 10TB

Hourly Cost - 40 vs 100 workers - 30TB


For these tests, it can be seen that a larger cluster size had a more significant impact on how long the whole set of queries took to execute, as the dataset increased (i.e. a higher hourly cost had a better return on the 30TB dataset vs. 10TB)


AWS Cost Management Recommendations

  • Use the EMR Managed Scaling feature in order to automatically resize clusters based on usage. Leaving an always-on large cluster will definitely result in thousands of dollars every month in EC2 and EMR cost. Dynamic scaling can significantly lower AWS cost by allocating capacity based on usage patterns.
  • Using automation tools, such as the AWS CLI or SDK, significantly simplifies the launch and management of EMR clusters, which can be used to implement custom, automated processes to reduce cost.
  • Consider using EMR automatic cluster termination after a configurable idle period, in order to avoid situations where large clusters are accidentally left running for an unnecessary amount of time.
  • EC2 Spot Instances are a good option to consider in order to reduce cost, given they can save up to 90% compared to EC2 On Demand pricing. Be aware that Spot instances can be terminated at any point in time; therefore it’s important to ensure they’re used to execute fault-tolerant, asynchronous data analytics processes.
  • Allocate storage (S3) and compute resources in the same AWS region, in order to avoid inter-regional data transfer fees between S3 and EC2. Most inter-regional data transfer fees within the US result in $20 per TB, but there are regions outside the US where it can cost as high as $147 per TB (e.g. Cape Town), which would result in several thousands of dollars of additional data transfer cost for the datasets included in these tests. In addition to increased cost, having data and compute resources in different AWS regions can have a significant negative impact on performance, especially in large datasets.
  • If there are no strict availability requirements, deploying all EC2 instances in the same VPC subnet can also reduce cost, given that Intra-regional Data Transfer In and Out costs $10 USD per terabyte of data transferred. For a large dataset, such as the one used for these tests, intra-regional data transfer can result in hundreds of dollars, depending on the number and type of queries being executed. For always-on applications with strict availability requirements, it is important to consider deployments in multiple AZs, but it is highly recommended to calculate cost in advance in order to avoid unexpected data transfer charges.
  • Once the usage patterns are predictable, it is highly recommended to purchase Reserved Instances. You can refer to this article I wrote, which details how to approach Reserved Instances in the context of EMR workloads.

Conclusions

  • EMR Spark successfully executed the full set of 22 TPC-H queries in both datasets (10TB and 30TB).
  • Queries showed consistent execution timings across all 3 sequential executions.
  • It is always recommended to use automation tools, such as the AWS CLI, SDK or CloudFormation, to launch and manage EMR clusters.
  • The average execution time does not necessarily increase proportionally to the size of the dataset. For example: 30TB with 40 executors took 3.17x to execute compared to 10TB with the same number of executor nodes. However, 30TB with 100 executors took 2.34x to execute compared to 10TB and 100 executor nodes.
    • Adding more nodes, in this case, resulted in better scalability when increasing the size of the dataset.
    • It is important to measure performance and calculate cost on larger datasets as the number of nodes is increased. There are situations where adding more nodes for larger datasets has a positive return on the overall query performance (potentially delivering cost savings). However, there is a point where higher cost might stop delivering proportionally better performance. Therefore, it’s important to execute tests with multiple cluster sizes and measure performance accordingly, in order to find the best cost/performance ratio for a particular application.

Do you need help optimizing your Big Data workloads in the cloud?

I can help you optimize your Big Data workloads in the cloud and make sure they deliver the right balance between performance and cost for your business. Click on the button below to schedule a free consultation or use the contact form.

Ernesto Marquez

ErnestoMarquezProfilePic

I am the Project Director at Concurrency Labs Ltd, ex-Amazon (AWS), Certified AWS Solutions Architect and I want to help you run AWS optimally, so your applications reliably generate revenue for your business.

Running an optimal AWS infrastructure is complicated - that's why I follow a methodology that makes it simpler to run applications that will support your business growth.

Do you want to learn more? Do you have other questions related to AWS? Click on the button below to schedule a free 30-minute consultation.

Do you have any comments or questions about this post, or my services?