Querying 6.35 Billion Records - a TPC-DS Iceberg Performance and Cost Comparison between Starburst Galaxy and Snowflake

Data Analytics continues to be a critical area for most modern applications and there are many solutions available, therefore it’s important for application owners to be aware of their options. Given that I focus on Amazon Web Services, I’m interested in solutions that can be launched on compute infrastructure in AWS and that access data stored in the cloud.

For this article, I focused on Data Analytics platforms Starburst Galaxy and Snowflake with compute and data storage infrastructure deployed on AWS. Similar to other articles, I will focus on areas such as performance, infrastructure setup and cost, with the following angle: “What do I need to do in order to run this workload, how fast will it be and how much will I pay for it?”

  • Starburst Galaxy is a managed data analytics platform built on top of the Trino query engine (formerly known as PrestoSQL). It is available on multiple cloud providers, such as AWS, GCP or Azure. It can also query data stored in a wide range of options, including these cloud providers and other data sources.
  • Snowflake is a managed data analytics platform that also supports launching compute infrastructure in the cloud (e.g. AWS, GCP and Azure) and a number of data sources, including storage in these mentioned cloud providers.

For the tests I executed, I focused on compute infrastructure deployed on Amazon Web Services, in the N. Virginia region and data stored in AWS S3 in the same region.

Data and Infrastructure Setup

Below are the infrastructure configuration details in order to execute these tests on both Starburst Galaxy and Snowflake.

Data

  • The data used in this set of tests corresponds to a 1TB TPC-DS dataset, in Apache Iceberg table format.
  • Files used in the Iceberg tables were stored in an S3 bucket owned by Concurrency Labs, in the N. Virginia region. The dataset resulted in 12,696 files, for a total of 234 GB. File size is expected, given they are stored in compressed format.
  • Both Starburst Galaxy and Snowflake accessed data through a catalog managed by the AWS Glue service.
  • Iceberg tables were created using DDL statements following this pattern: CREATE TABLE glue.<table-name> (<column-definition) WITH (format=PARQUET, location=<s3-location>, type=ICEBERG;)
  • Data was inserted into S3 using as a source data stored in Parquet format available in the tpcds.sf1000 schema managed by Starburst Galaxy. This was done using INSERT SQL statements for each TPC-DS table, with the following pattern: INSERT INTO glue.<schema>.<table-name> SELECT <columns> FROM tpcds.sf1000.<table-name>;
  • Row count for each table was confirmed to comply with the TPC-DS standard (approximately 6.35 billion records in total, stored in 24 tables).
  • The 99 TPC-DS queries used in these tests correspond to the TPC-DS specification.

Compute Infrastructure

Compute infrastructure was launched using the Starburst Galaxy and Snowflake GUI. Both platforms were configured to launch compute resources on Amazon Web Services in the N. Virginia AWS region (us-east-1) and were granted cross-account permissions (IAM Roles) to access data stored in an S3 bucket owned by Concurrency Labs and an AWS Glue catalog in the same AWS account as the S3 bucket.

Starburst Galaxy

  • Size: Large (16 credits/hour).
  • Cloud provider: AWS - N. Virginia
  • Cluster type: Accelerated (Warp Speed)
  • Number of workers: 16
  • Query result caching: off.

Snowflake

  • Size: X-Large Warehouse (16 credits/hour).
  • Cloud provider: AWS - N. Virginia
  • Type: Standard
  • Number of clusters:
    • 1 scenario with multi-cluster disabled (16 credits)
    • 1 scenario with scaling enabled to provision between 1 and 3 clusters (16-48 credits). This warehouse had 3 clusters provisioned throughout the whole query execution, given the load induced into the system.
  • Query acceleration: off
  • Query result caching: off (ran statement ALTER SESSION SET USE_CACHED_RESULT=FALSE)

Both Starburst Galaxy and Snowflake deliver a well documented way to launch compute infrastructure. However, none of these platforms expose more details about the underlying compute capacity available in each deployment type (i.e. vCPUs, memory, etc.), which would allow users to have more context regarding other cloud-based solutions, such as AWS EC2 or AWS EMR.

Test Scenarios

All scenarios were executed with 5 concurrent users sequentially executing the 99 TPC-DS queries, iterating through the whole set of queries a number of 5 times per user (i.e. each user executes queries 1 through 99 and repeats the sequence 5 times). Below are the three platform scenarios tested:

Scenario Description
TPC-DS 1 TB Iceberg - Starburst Galaxy Large 5 concurrent users executed 5 iterations of 99 TPC-DS queries sequentially, using Starburst Galaxy on a Large cluster of type Warp Speed.
TPC-DS 1 TB Iceberg - Snowflake X-Large - 1 cluster Same as above, but using a Snowflake X-Large Standard Warehouse with multi-cluster disabled.
TPC-DS 1 TB Iceberg - Snowflake X-Large - 1-3 clusters Same as above, but using a Snowflake X-Large Standard Warehouse with multi-cluster scaling enabled between 1 and 3 clusters.

Tests were executed using Apache JMeter, a load testing tool designed to trigger and measure performance of concurrent requests. Tests were triggered from an EC2 instance launched in the N. Virginia region and the results below were calculated using the elapsed time reported for each transaction by JMeter and recorded in the EC2 instance.

Test Results

TPC-DS 1 TB Iceberg

The table below shows the average, maximum and minimum execution time in seconds, for the 25 executions triggered for each query (5 concurrent users executing 5 sequential sets of TPC-DS queries).

Starburst Galaxy - Large - WarpSpeed Snowflake - X-Large Standard Warehouse - 1 cluster Snowflake - X-Large Standard Warehouse - 3 clusters
avgmaxminavgmaxminavgmaxmin
q017183q0112477q015113
q02372q0216405q02483
q03251q0311644q03361
q04294712q046814339q04425132
q0512186q0525635q055113
q06473q0616566q06463
q076103q0713693q07462
q08574q0814623q08342
q097134q09386022q09212815
q10594q1022625q10483
q1117238q11427521q11253119
q12352q126512573q12241
q1310216q136612585q13463
q14355014q146980826q14222917
q15483q15477954q15594
q169155q1612433q16482
q177134q1719486q17584
q189165q1815445q18584
q19493q19385326q19583
q20362q2024813053q20241
q21342q211108383q21241
q2210146q221228423q22463
q23466920q23220266027q23294021
q24212910q24265143622132q24222343442123
q256164q2519511227q25473
q26473q2614312975q26352
q27483q271297823q27362
q288133q288887213q28141812
q298255q2914512997q29574
q30584q3037325233q305103
q31694q311008185q316124
q32232q329502q32131
q338174q337812933q334102
q34343q3411341q34251
q356124q35488417q358125
q36473q366412914q36241
q37242q379303q37241
q388174q385885016q38142110
q39785q398193q39352
q40342q4010283q40352
q41111q4110601q41130
q42241q426381q42141
q43352q4310392q43221
q44352q446312247q447115
q45584q4513755q45473
q465113q46447605q46574
q47324815q4716319q478147
q489345q4811204q48363
q496104q4911225q496123
q5017319q5011186q50574
q517143q5112265q51574
q52231q527391q52131
q53372q537312q53241
q54584q549273q54362
q55221q558212q55131
q566123q569243q56382
q5719309q5712325q57593
q589264q58181183q58362
q596142q59202166q59694
q607134q60211635q60573
q616124q61181234q61574
q625133q62374912q62241
q63372q63261492q63241
q64172313q6410256922q64242920
q6511165q65228510q658116
q66583q66271692q66482
q67699236q67183324149q67158176141
q685103q68261546q68463
q69574q69361536q69483
q707145q7023554q70473
q71483q71201453q71251
q72284615q724918419q72192414
q73353q73423062q73231
q749185q744516920q74202715
q7516299q75326216q75162311
q769214q76372976q76794
q77694q7713533q77462
q78294312q7812115193q787310653
q795114q796212414q795104
q807105q8075130610q808126
q816144q81557934q81463
q82362q82528382q82242
q836194q8312292q83371
q845153q84201093q84362
q8510207q85191015q856114
q86583q869221q86241
q877153q87335218q87223017
q888135q8878128614q88162113
q89362q8910282q89251
q90362q90418151q90381
q915164q916312931q91241
q92292q928192q92352
q93223911q93448355q93695
q946154q9416752q94372
q9513316q9512303q95472
q96382q967112q96252
q9710166q9718487q979146
q983102q9817772q98231
q997144q9910532q99362
Avg
(sec)
9165Avg
(sec)
7344830Avg
(sec)
315628
Total
(sec)
8611,549482Total
(sec)
7,26144,3582,967Total
(sec)
3,0435,5552,741


Average execution time (sec)Comparisons
Starburst GalaxySnowflake - 1 clusterSnowflake - 3 clustersGalaxy vs Snowflake 1 clusterGalaxy vs Snowflake 3 clustersSnowflake 3 clusters vs. 1 cluster
q0171250.551.330.42
q0231640.220.810.27
q0321130.220.920.24
q042968420.430.690.61
q05122550.472.560.18
q0641640.240.950.25
q0761340.451.660.27
q0851430.361.930.19
q09738210.180.320.55
q1052240.251.320.19
q111742250.40.660.6
q1236520.041.450.03
q13106640.152.260.07
q143569220.51.60.31
q1544750.080.80.11
q1691240.742.490.3
q1771950.361.260.29
q1891550.591.830.32
q1943850.110.980.12
q20324820.011.420.01
q21311020.021.80.01
q221012240.082.40.03
q2346220290.211.620.13
q24N/AN/AN/AN/AN/AN/A
q25619540.031.30.02
q26414330.031.520.02
q27412930.031.540.02
q28888140.090.580.16
q29814550.061.750.03
q30537350.011.170.01
q31610060.061.050.06
q322910.241.620.15
q3387840.11.830.05
q3431120.281.480.19
q3564880.120.770.16
q3646420.061.90.03
q372920.241.210.2
q38858140.130.560.24
q397830.862.10.41
q4031030.311.110.28
q4111010.091.290.07
q422610.271.70.16
q4331020.281.870.15
q4436370.050.410.11
q4551340.391.290.3
q4654450.111.060.11
q47321681.993.880.51
q4891130.812.560.31
q4961160.551.050.52
q50171151.633.870.42
q5171250.551.350.41
q522710.261.890.14
q533720.371.630.23
q545930.591.580.38
q552810.21.60.13
q566930.651.850.35
q57191251.564.040.39
q5891830.52.620.19
q5962060.31.070.28
q6072150.321.470.22
q6161850.331.240.26
q6253720.142.430.06
q6332620.111.750.06
q6417102240.170.730.23
q65112280.491.30.38
q6652740.191.190.16
q67691831580.380.440.86
q6852640.171.070.16
q6953640.141.30.11
q7072340.311.890.16
q7142020.181.590.11
q722849190.571.510.38
q7334220.0820.04
q74945200.190.430.45
q751632160.51.030.49
q7693770.231.30.18
q7761340.461.610.28
q7829121730.240.40.61
q7956250.0810.08
q8077580.10.940.1
q8165540.11.280.08
q8235220.061.320.04
q8361230.462.070.22
q8452030.221.50.15
q85101960.511.670.3
q865920.522.760.19
q87733220.20.30.65
q88878160.110.510.21
q8931020.311.630.19
q9034130.071.070.07
q9156320.072.090.04
q922830.290.810.35
q93224460.513.660.14
q9461630.372.190.17
q95131241.123.50.32
q963720.421.220.34
q97101890.571.110.51
q9831720.161.560.11
q9971030.692.40.29
Avg94780.181.020.18
Median62240.241.460.19
Max693731581.994.040.86
Min1610.010.30.01
Total (avg)8404,610820

q24 took significantly longer to execute in Snowflake scenarios compared to Galaxy (>100x on average). Therefore this query was removed from the comparison tables, in order to avoid data distortion and to make the total comparisons more accurate. Even though this would require more troubleshooting, data from similar tests using Snowflake interacting with other catalogs show better performance for this query, therefore this could be related to the Snowflake integration with AWS Glue as a data catalog.

TPC-DS results - 1 cluster

TPC-DS results - 1 vs 3 clusters

The table below shows the average query execution time relative to each scenario, as a percentage (excluding q24):

relative to:
Starburst Galaxy
Large - WarpSpeed
Snowflake X-Large
Standard Warehouse
1 cluster
Snowflake X-Large
Standard Warehouse
3 clusters
Starburst Galaxy
Large - WarpSpeed
N/A 18% 102%
Snowflake X-Large
Standard Warehouse
1 cluster
555% N/A 562%
Snowflake X-Large
Standard Warehouse
3 clusters
98% 18% N/A

Observations:

  • q24 took > 100x to complete in Snowflake scenarios compared to Starburst Galaxy.
  • All queries executed successfully for both Starburst Galaxy and Snowflake.
  • Starburst Galaxy was on average 5.55x faster compared to Snowflake X-Large running 1 cluster (excluding q24 from this comparison).
  • On average, Starburst Galaxy performance was virtually equal compared to Snowflake X-Large running 3 clusters (excluding q24 from this comparison).


Cost Analysis

The following table compares the cost of running a Large Starburst Galaxy cluster (16 credits/hour) and Snowflake Warehouses (1 and 3 clusters) in the AWS N. Virginia region. The cost per credit below is based on the publicly available documentation for both platforms.

Starburst Galaxy Large

Component Price Dimension Usage Hourly Cost Monthly Cost
Data store S3 Standard Storage 234 GB (Iceberg format) $0.01 $5
Starburst Galaxy Large Cluster Credits per hour ($2.80) Large cluster: 16 credits per hour $44.80 $32,256
Data Transfer S3 to EC2 - Intra-Regional Data Transfer N/A $0.00 $0.00
Total: $44.81 $32,261

Snowflake X-Large Warehouse - 1 cluster

Component Price Dimension Usage Hourly Cost Monthly Cost
Data store S3 Standard Storage 234 GB (Iceberg format) $0.01 $5
Snowflake X-Large Warehouse (1 cluster) Credits per hour ($3.00) X-Large warehouse (1 cluster): 16 credits per hour $48.00 $34,560
Data Transfer S3 to EC2 - Intra-Regional Data Transfer N/A $0.00 $0.00
Total: $48.01 $34,565

Snowflake X-Large Warehouse - 3 clusters

Component Price Dimension Usage Hourly Cost Monthly Cost
Data store S3 Standard Storage 234 GB (Iceberg format) $0.01 $5
Snowflake X-Large Warehouse (3 clusters) Credits per hour ($3.00) X-Large warehouse (3 clusters): 48 credits per hour $144.00 $103,680
Data Transfer S3 to EC2 - Intra-Regional Data Transfer N/A $0.00 $0.00
Total: $144.01 $103,685

It is highly recommended to configure the Auto Suspend feature - for both Galaxy and Snowflake, within a range of 3-5 minutes in order to avoid the possibility of accidentally leaving compute resources running when not being utilized. For AWS deployments, it’s also highly recommended to deploy compute resources and storage (in this case, S3) in the same AWS region, in order to avoid Intra-Regional Data Transfer cost and reduce latency as much as possible.

Considering all required components (storage, compute, etc.), owning an always-on Starburst Galaxy Large cluster would cost approximately $32.3K per month, while an equivalent Snowflake X-Large warehouse (1 cluster, 16 credits) would cost $34.6K. The Snowflake warehouse that had equivalent performance to Starburst Galaxy (Large Warehouse with 3 clusters - 48 credits) would cost approximately $104K per month, or 3.2x compared to Starburst Galaxy.

Conclusions

  • Both platforms are relatively simple to launch using their respective GUIs. When using data stored in AWS, it is required to configure cross-account IAM Roles in order to grant these platforms access to the data that will be analyzed.
  • Neither of these two platforms exposes more details about the underlying compute capacity available in each deployment type (i.e. vCPUs, memory, etc.). This information would be useful, since it would allow application owners to have more context regarding other cloud-based solutions, such as AWS Redshift, EC2 or EMR. The main comparison that can be made against other cloud-based platforms is based on cost relative to performance.
  • All TPC-DS queries executed successfully for Starburst Galaxy and Snowflake.
    • However, q24 took approximately >100x to complete in Snowflake scenarios compared to Starburst Galaxy, therefore it was removed from comparison calculations.
  • Excluding q24, Starburst Galaxy Large (16 workers) was on average 5.55x faster compared to Snowflake X-Large running 1 cluster (16 credits).
  • Excluding q24, Starburst Galaxy Large (16 workers) had on average equal performance compared to Snowflake X-Large running 3 clusters (48 credits).
  • Considering all required components (storage, compute, etc.), running an always-on Starburst Galaxy Large cluster would cost approximately $32.3K per month, while an equivalent Snowflake X-Large warehouse (1 cluster, 16 credits) would cost $34.6K.
  • ​​The Snowflake warehouse that had equivalent performance to Starburst Galaxy (Snowflake Large Warehouse with 3 clusters - 48 credits) would cost $104K per month, or approximately 3.2x compared to Starburst Galaxy.

Do you need help evaluating the right Data Analytics platform or optimizing your Big Data workloads in the cloud?

I will help you find the right Data Analytics platform for your business needs and optimize your Big Data workloads in the cloud to make sure they deliver the right balance between performance and cost for your business. Click on the button below to schedule a free consultation or use the contact form.

Ernesto Marquez

ErnestoMarquezProfilePic

I am the Project Director at Concurrency Labs Ltd, ex-Amazon (AWS), Certified AWS Solutions Architect and I want to help you run AWS optimally, so your applications reliably generate revenue for your business.

Running an optimal AWS infrastructure is complicated - that's why I follow a methodology that makes it simpler to run applications that will support your business growth.

Do you want to learn more? Do you have other questions related to AWS? Click on the button below to schedule a free 30-minute consultation.

Do you have any comments or questions about this post, or my services?