Querying 6.35 Billion Records - a TPC-DS Performance and Cost Comparison between Big Data platforms Starburst Enterprise and EMR SQL engines

Choosing the right Data Analytics solutions in the cloud is an increasingly important area for many applications. While AWS offers a wide range of managed services, it’s also important to explore other solutions that can be deployed in AWS and that can bring benefits to application owners who are looking to analyze large amounts of data. One tool with these characteristics is Starburst Enterprise.

In this article, I’ll compare performance, infrastructure setup, maintenance and cost related to 4 Data Analytics solutions: Starburst Enterprise, EMR Presto, EMR Spark and EMR Hive. As in previous articles, I want to answer the following: “What do I need to do in order to run this workload, how fast will it be and how much will I pay for it?”

Let’s do a refresh of the solutions under test…

  • AWS Elastic Map Reduce (EMR) is a managed service offered by AWS. It supports many data analysis platforms, such as Hadoop, Presto, Hive, Spark, and others. With EMR, developers can launch a compute cluster with pre-installed data analytics software and use default or custom configurations. EMR supports launching compute clusters consisting of EC2 instances or Fargate containers and it allows to configure compute capacity as well as Auto Scaling configurations to adjust compute resources based on system load.
  • Starburst Enterprise is a commercial distribution of Trino (formerly known as PrestoSQL) It can query data stored in multiple data sources (e.g. AWS S3, Hadoop, SQL and NoSQL databases, Google Cloud Storage, etc.). It can be easily launched in AWS through the AWS Marketplace.
  • EMR Presto is the Presto distribution delivered by AWS EMR.
  • Spark is an open source, distributed, in-memory, data analysis framework designed for large scale, that is also available in EMR.
  • Hive is an open source data warehouse solution built on top of Apache Hadoop. HiveQL also allows users to use SQL syntax to analyze data and it’s available in EMR as a managed solution. Of all platforms in this analysis, Hive is the one with the longest time in the market, with its initial release back in 2010.

Data and Infrastructure Setup

Below are the steps I followed, across all solutions, in order to run these tests:

Data

In all tests I leveraged the TPC-DS benchmark, which consists of a data set and queries created by the Transaction Processing Performance Council (TPC). TPC-DS is an industry standard when it comes to measuring performance across data analytics tools and databases in general. Please note, however, that this is not an official audited benchmark as defined by the TPC rules.

I created two 1TB TPC-DS data sets (ORC and Parquet), stored in AWS S3. Data sets contain approximately 6.35 billion records stored in 24 tables. The TPC-DS standard also consists of 99 SQL queries. For this test, the following tables were partitioned as described below:

  • catalog_returns on cr_returned_date_sk
  • catalog_sales on cs_sold_date_sk
  • store_returns on sr_returned_date_sk
  • store_sales on ss_sold_date_sk
  • web_returns on wr_returned_date_sk
  • web_sales on ws_sold_date_sk

Infrastructure

  • In order to maintain a data catalog, I set up an RDS-backed Hive Metastore using the steps described in the AWS documentation. This was powered by a single m4.large EC2 instance managed by EMR and a t3.medium RDS DB instance. All tests used the same Hive Metastore, which points to the S3 location for ORC and Parquet files, respectively.
  • I used Starburst Enterprise version 350-e available in the AWS Marketplace and applied default configurations. I launched the cluster using the CloudFormation template available in the AWS Marketplace page. The CloudFormation template allows users to configure compute and storage capacity (EC2 instance types and count, EBS volume types and size ) as well as custom Presto configurations (i.e. memory allocation, metastore location, etc.). The stack creates an Auto Scaling Group with all the provisioned EC2 instances and a number of custom CloudWatch metrics useful for detailed monitoring and troubleshooting. Support for CloudFormation allows application owners to easily implement automation procedures that launch, update or terminate clusters as needed.
  • I used the following versions for EMR tests: Presto 0.230, Spark 3.0.0 and Hive 3.1.2. I applied default EMR configurations in all tests (i.e. memory allocation per node or maximum allocation per query). Launching an EMR cluster can also be automated by using the AWS SDK or CLI, which simplifies infrastructure operations in general.
  • For EMR Hive, I enabled Live Long and Process (LLAP), which is a feature that is expected to increase performance by introducing in-memory caching. I configured default LLAP settings, which according to AWS documentation result in EMR allocating about 60 percent of cluster YARN resources to Hive LLAP daemons.
  • I executed all tests with clusters consisting of 20 r5.4xlarge workers and 1 master node running also on a r5.4xlarge EC2 instance.

Benchmark Executions

For each platform, I ran the same set of 99 queries that are part of the TPC-DS benchmark. Queries were executed sequentially against the same 1TB dataset stored in S3. For each platform, each query set was executed 4 times in a sequential order and the average of these 4 executions is the number I report in this article. Each test for a particular platform was executed on a fresh cluster, in order to make all conditions equal. I ran ANALYZE statements for each of the 24 tables that are part of the test.

I ran separate tests for both ORC and Parquet data formats and below are the results.

ORC - Performance Comparison

qid Starburst 350-e EMR Presto 0.230 EMR Spark 3.0.0 EMR Hive 3.1.2 LLAP
q01 7 11 39 N/A
q02 22 23 43 N/A
q03 4 23 28 34
q04 33 N/A 127 163
q05 16 70 52 139
q06 4 44 33 35
q07 7 19 36 53
q08 7 11 32 26
q09 24 22 71 144
q10 6 24 38 40
q11 19 N/A 62 115
q12 2 10 30 28
q13 8 64 42 60
q14 70 185 76 N/A
q15 4 11 33 37
q16 15 52 67 N/A
q17 9 40 40 61
q18 6 13 39 62
q19 6 21 32 33
q20 3 11 30 29
q21 6 14 N/A N/A
q22 13 10 33 138
q23 71 116 117 N/A
q24 37 47 76 N/A
q25 7 28 40 60
q26 4 11 32 54
q27 6 19 34 N/A
q28 19 21 70 146
q29 10 27 49 120
q30 5 17 38 N/A
q31 9 29 41 58
q32 3 11 31 N/A
q33 6 17 35 30
q34 4 12 N/A 44
q35 7 11 43 82
q36 4 22 34 N/A
q37 11 10 36 116
q38 10 16 50 108
q39 3 28 36 36
q40 13 61 41 159
q41 1 1 24 18
q42 3 15 28 28
q43 7 24 30 51
q44 11 11 48 147
q45 3 11 33 42
q46 6 25 34 57
q47 36 124 43 76
q48 7 31 38 55
q49 14 81 48 157
q50 11 43 62 177
q51 12 22 48 105
q52 5 15 28 27
q53 5 13 N/A 57
q54 12 31 36 135
q55 3 15 28 27
q56 5 18 34 29
q57 20 64 41 76
q58 7 45 N/A 48
q59 24 31 40 143
q60 7 19 35 31
q61 6 41 30 27
q62 11 28 N/A N/A
q63 5 14 35 53
q64 72 122 75 167
q65 17 30 51 132
q66 6 78 39 69
q67 59 501 402 244
q68 7 34 33 31
q69 5 10 36 32
q70 17 33 41 N/A
q71 6 20 36 33
q72 34 N/A 53 1501
q73 4 16 29 27
q74 15 69 52 98
q75 30 82 N/A 163
q76 12 31 50 153
q77 11 74 40 37
q78 52 104 74 501
q79 5 30 33 35
q80 23 84 54 166
q81 6 15 40 N/A
q82 11 25 41 121
q83 6 26 N/A 34
q84 11 15 35 138
q85 15 23 57 166
q86 4 10 31 N/A
q87 10 24 54 123
q88 23 70 48 118
q89 6 16 35 56
q90 11 10 33 120
q91 4 45 30 20
q92 3 10 31 N/A
q93 27 99 74 155
q94 13 36 60 N/A
q95 34 83 83 N/A
q96 10 10 29 116
q97 12 31 46 119
q98 3 19 31 31
q99 10 61 42 142
Avg 13.38 40.01 48.03 106.38
Max 72.00 501.00 402.00 1501.25
Min 0.89 1.00 23.50 17.50

ORC results

The table below shows the average query execution time relative to each engine, as a percentage.

relative to:
EMR Presto 0.230 EMR Spark 3.0.0 EMR Hive 3.1.2 LLAP Starburst 350-e
EMR Presto 0.230 N/A 83% 38% 299%
EMR Spark 3.0.0 120% N/A 45% 359%
EMR Hive 3.1.2 LLAP 266% 221% N/A 795%
Starburst 350-e 33% 28% 13% N/A

Observations for the ORC dataset:

EMR Presto didn’t return any syntax errors, but it returned 2 query execution failures: q4 and q11 (Insufficient Memory). After 30 minutes, I had to cancel execution for q72, so I’m considering that query as failed as well. Excluding these 3 queries, the average execution time for EMR Presto was 40s.

EMR Spark initially returned 36 errors when parsing queries, which were corrected by applying minor updates (mostly adding ` to some strings). For the following 7 queries, I couldn’t find a simple way to fix errors without possibly altering the query structure (and the validity of the test): q21, q34, q53, q58, q62, q75, q83. Excluding those queries, the average query execution time was 48s.

EMR Hive returned syntax errors for the following 17 queries, which similarly to Spark errors I couldn’t find a simple way to fix without possibly affecting the validity of the test: q14, q16, q23, q24, q27, q30, q32, q36, q62, q70, q81, q86, q92, q94, q95.

There were no failures returned by Starburst Enterprise and its average query execution was 13.4 seconds - the fastest among all 4 engines under analysis and it was the only engine that successfully executed all 99 queries. On average, Starburst Enterprise was 2.5x faster compared to EMR Presto, 3.9x faster compared to EMR Spark and 7.1x faster compared to EMR Hive.

Parquet - Performance Comparison

qid Starburst 350-e EMR Presto 0.230 EMR Spark 3.0.0 EMR Hive 3.1.2 LLAP
q01 7 13 73 N/A
q02 22 22 40 N/A
q03 4 22 28 64
q04 40 N/A 120 194
q05 16 68 49 211
q06 5 42 71 36
q07 9 25 33 89
q08 6 11 30 37
q09 37 34 48 221
q10 6 28 36 73
q11 22 N/A 80 139
q12 3 10 74 32
q13 12 70 74 92
q14 83 205 67 N/A
q15 4 13 32 55
q16 15 28 65 N/A
q17 9 50 43 72
q18 7 18 41 94
q19 8 28 31 53
q20 3 11 76 34
q21 21 30 N/A N/A
q22 25 19 40 118
q23 78 130 82 N/A
q24 44 54 68 N/A
q25 8 35 41 68
q26 5 16 32 85
q27 8 25 34 N/A
q28 43 43 50 212
q29 12 32 48 174
q30 7 41 73 N/A
q31 10 34 37 82
q32 3 11 78 N/A
q33 5 20 34 37
q34 5 17 N/A 74
q35 7 12 41 138
q36 6 20 33 N/A
q37 17 17 39 182
q38 11 43 51 162
q39 19 39 71 53
q40 15 65 67 233
q41 1 1 23 18
q42 4 16 29 36
q43 2 22 30 79
q44 18 17 71 216
q45 4 11 31 50
q46 7 24 34 96
q47 33 138 73 107
q48 9 36 72 90
q49 18 84 44 236
q50 13 57 83 245
q51 12 27 49 131
q52 4 16 29 38
q53 5 16 N/A 81
q54 14 30 37 213
q55 4 15 28 35
q56 5 20 35 37
q57 20 72 77 106
q58 7 49 N/A 55
q59 27 34 39 212
q60 5 20 35 46
q61 2 31 31 44
q62 11 32 N/A 189
q63 6 16 78 82
q64 86 137 71 272
q65 19 35 76 152
q66 7 80 37 95
q67 57 532 396 271
q68 11 27 32 48
q69 5 11 36 53
q70 18 35 37 N/A
q71 8 22 38 52
q72 46 N/A 76 N/A
q73 5 13 29 53
q74 17 N/A 52 126
q75 33 90 N/A 263
q76 15 33 44 219
q77 11 81 44 49
q78 54 131 72 735
q79 8 27 33 81
q80 26 100 52 249
q81 7 16 73 N/A
q82 19 24 45 191
q83 6 27 N/A 39
q84 12 13 69 200
q85 15 25 73 230
q86 5 10 29 N/A
q87 11 25 54 182
q88 50 84 41 198
q89 6 17 78 84
q90 11 10 31 183
q91 2 47 31 26
q92 3 10 76 N/A
q93 32 101 70 237
q94 14 36 75 N/A
q95 35 83 83 N/A
q96 11 11 30 192
q97 13 31 46 149
q98 3 20 76 36
q99 11 62 75 200
Avg 15.95 43.97 55.70 128.58
Max 85.50 531.50 396.00 735.00
Min 0.88 1.00 23.25 18.00

Parquet results

The table below shows the average query execution time relative to each engine, as a percentage:

relative to:
EMR Presto 0.230 EMR Spark 3.0.0 EMR Hive 3.1.2 LLAP Starburst 350-e
EMR Presto 0.230 N/A 79% 34% 276%
EMR Spark 3.0.0 127% N/A 43% 349%
EMR Hive 3.1.2 LLAP 292% 231% N/A 806%
Starburst 350-e 36% 29% 12% N/A

Observations for the Parquet dataset:

EMR Presto experienced the same errors as in the ORC execution, with the addition of q74 (Insufficient Memory), for a total of 4 failures. Excluding these 4 queries, the average execution time for EMR Presto was 44s.

EMR Spark returned the same 36 errors as in the ORC execution and the same 7 queries could not be fixed (q21, q34, q53, q58, q62, q75, q83). Excluding those queries, the average query execution time was 55.7s.

All EMR Hive errors in the Parquet executions are the same as ORC (ParseException). The only additional error in EMR Hive Parquet executions is q72, which I had to cancel after 30mins. Excluding failed queries, the average query execution time was 128.6s.

Starburst Enterprise didn’t return any failures and its average query execution was approximately 16 seconds - the fastest among all 4 engines under analysis. On average, Starburst Enterprise was 2.38x faster compared to EMR Presto, 3.45x faster compared to EMR Spark and 9.1x faster compared to EMR Hive.

ORC vs. Parquet Summary

Starburst Enterprise executed all queries successfully in both ORC and Parquet formats. EMR Presto returned one additional error in Parquet (q74: Insufficient Memory) compared to ORC. For EMR Hive, I had to cancel q72 after 30 minutes in the Parquet test.

The following table shows average execution times for each engine and data format. In general, all four engines performed better when querying ORC data vs. Parquet.

Average execution (seconds):

ORC Parquet ORC vs. Parquet
Starburst 350-e 13.38 15.95
EMR Presto 0.230 40.01 43.97
EMR Spark 3.0.0 48.03 55.70
EMR Hive 3.1.2 LLAP 106.38 128.58

AWS Cost Analysis

The following table compares the cost of running a cluster consisting of 20 executors + 1 coordinator in the N. Virginia region as well as a monthly projection assuming 720 hours per month for an always-on cluster.

Both EMR and Starburst Enterprise share a pricing model where you pay the regular EC2 instance compute fee plus a license fee.

Starburst Enterprise EMR Hive
Component Price Dimension Usage Hourly Cost Monthly Cost Hourly Cost Monthly Cost
Data store 234GB (ORC) 234GB $0.01 $5 $0.01 $5
Executor + Coordinator Nodes EC2 Compute 11 r5.8xlarge instances $22.18 $15,966.72 $22.18 $15,966.72
Starburst Presto Enterprise License AWS Marketplace ($0.799/hour for 1 r5.8xlarge) 11 r5.8xlarge instances $8.79 $6,328.08 N/A N/A
EMR Fee $0.252/hr (r5.4xlarge) 11 r5.8xlarge instances N/A N/A $2.97 $2,138.40
Hive Metastore EC2 Compute + EMR fee 1 m4.large $0.12 $86.40 N/A N/A
RDS MySQL Hive Metastore storage 1 db.t3.medium $0.07 $48.96 $0.07 $48.96
EBS Storage EBS Volume Usage - gp2 50GB x 11 = 550GB $0.08 $55.00 $0.08 $55.00
Data Transfer S3 to EC2 - Intraregional Data Transfer N/A $0.00 $0.00 $0.00 $0.00
$31.24 $22,490.54 $25.30 $18,214.46

License costs are 60% higher for Starburst Enterprise compared to EMR ($0.40/hour vs. $0.25/hour per each r5.4xlarge EC2 instance). The monthly license cost for a 21-node, always-on, cluster would be $3.8K for EMR, compared to $6K for Starburst Enterprise. EMR supports per-second billing, while Starburst Enterprise bills per hour.

However, it’s worth noting that the average execution took Starburst Enterprise a fraction of the time it took other engines running on EMR (between 12% and 36% on average). Results will vary according to your specific needs, but having a faster tool will result in less compute time consumed. With this in mind and applying usage-based scaling, I think it’s fair to expect at least 50% savings when using Starburst Enterprise.

AWS Cost Management Recommendations

If not managed properly, Big Data analytics workloads can easily turn into costly deployments. Therefore it’s important to apply as many optimization strategies as possible. Here are some recommendations:

  • Workloads often result in many TBs of data being transferred across different servers and data sources, which can easily turn into high data transfer costs. Here are two areas to pay special attention to:
    • Inter-regional Data Transfer. In order to avoid intra-regional data transfer fees, it’s important to ensure S3 buckets (or any data sources) are deployed in the same AWS region as EC2 instances or Fargate containers. Otherwise, you should expect inter-regional data transfer fees between S3 and compute nodes. These charges can range between $10 and even close to $150 per TB, depending on the origin and destination regions. Most regions in the US result in $20 per TB for inter-regional data transfer. Also, performance will be much better if all components are deployed in the same region.
    • Intra-regional Data Transfer. Even though in most applications it’s a good practice to deploy compute resources across multiple Availability Zones, for many big data analysis deployments this can turn out to be expensive. Data transfer across AZs in the same region costs $10/TB in each direction. Given that clusters usually transfer large amounts of data across nodes, this cost can add up over time (I’ve seen it reach a few thousand dollars per month in some cases). If your use case can handle a situation without extra AZ redundancy, you could save money by deploying all compute resources in the same AZ.
  • Always use automation tools, such as custom scripts or CloudFormation templates. This way clusters can be terminated or adjusted easily, resulting in cost savings. Starburst Enterprise provides a CloudFormation template, which simplifies the provisioning of clusters. EMR has a number of APIs and CLI commands available that simplify cluster management automation.
  • Both EMR and Starburst Enterprise support Auto Scaling. I recommend applying rules that decrease cluster size to a minimum during idle periods, based on CPU Utilization metrics. You can also configure Scheduled Actions to decrease cluster size outside of high utilization hours.
  • Always consider purchasing EC2 Reserved Instances or Savings Plans. There are a number of variables to consider before purchasing EC2 Reserved Instances, the most important one being the type of compute capacity that you’ll need in the long run. I wrote two articles about this, one for EC2 and one for EMR, which will give you more details. Spot Instances are always an option. You only have to make sure that your use case supports a situation where instances get terminated in the middle of an execution. You can achieve savings of close to 80% compared to EC2 On Demand cost.

Conclusions

  • For the ORC dataset, on average Starburst Enterprise was 2.5x faster compared to EMR Presto, 3.84x faster compared to EMR Spark and 7.14x faster compared to EMR Hive with LLAP enabled.
  • For the Parquet dataset, on average Starburst Enterprise was 2.38x faster compared to EMR Presto, 3.45x faster compared to EMR Spark and 7.14x faster compared to EMR Hive with LLAP enabled.
  • Starburst Enterprise successfully executed all 99 queries in scope. Excluding query syntax errors, EMR Presto returned 3 failures in the ORC dataset and 4 failures in the Parquet dataset due to Insufficient Memory errors. Also, excluding query syntax errors, EMR Hive Parquet execution had to be cancelled for q72 after 30mins of execution. EMR Spark didn’t result in execution times beyond 30 minutes or failed executions due to memory errors, without considering the 7 queries that could not be parsed.
  • Even though license costs are 60% higher for Starburst Enterprise compared to EMR ($0.40/hour vs. $0.25/hour per each r5.4xlarge EC2 instance), the higher performance delivered by Starburst will result in lower compute cost compared to EMR. Most likely 50%-70% lower depending on the type of workload and usage patterns.

Considering cost, performance and infrastructure setup, Starburst Enterprise is definitely an option that should be considered for data analytics workloads in AWS. While EMR provides a solid solution and a wide range of platforms, for this set of tests Starburst Enterprise delivered better results compared to EMR.

Do you need help optimizing your Big Data workloads in the cloud?

I can help you optimize your Big Data workloads in the cloud and make sure they deliver the right balance between performance and cost for your business. Click on the button below to schedule a free consultation or use the contact form.

Ernesto Marquez

ErnestoMarquezProfilePic

I am the Project Director at Concurrency Labs Ltd, ex-Amazon (AWS), Certified AWS Solutions Architect and I want to help you run AWS optimally, so your applications reliably generate revenue for your business.

Running an optimal AWS infrastructure is complicated - that's why I follow a methodology that makes it simpler to run applications that will support your business growth.

Do you want to learn more? Do you have other questions related to AWS? Click on the button below to schedule a free 30-minute consultation.

Do you have any comments or questions about this post, or my services?