Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.

Amazon Redshift continues to lead in data warehouse price-performance (for examples, see Amazon Redshift continues its price-performance leadership, Amazon Redshift: Lower price, higher performance, and Get up to 3x better price performance with Amazon Redshift than other cloud data warehouses). Amazon Redshift’s advanced Query Optimizer is a crucial part of that leading performance. The Query Optimizer is responsible for finding the fastest way (or plan) to execute a query. It does this by using statistics about the data together with the query to calculate a cost of executing the query for many different plans.

Amazon Redshift has built-in autonomics to collect statistics called automatic analyze (or auto analyze). Auto analyze is a background operation that runs automatically on Redshift tables to keep statistics up-to-date. Statistics collection, however, can be computationally expensive, making it a challenge to keep statistics up-to-date particularly when data is continuously being ingested. As data is ingested into the Redshift data warehouse over time, statistics could become stale, which in turn causes inaccurate selectivity estimations, leading to sub-optimal query plans that impact query performance.

Challenges with stale statistics

Based on Redshift fleet analysis of customer workloads, we found that the staleness of statistics is an especially important factor in the selectivity estimation of predicates with temporal columns such as those with DATE and TIMESTAMP data types. This is due to the following reasons: 1) DATE and TIMESTAMP represent about 11% of predicate columns in the queries in the Amazon Redshift fleet (see Figure 1); 2) More than 40% of query scan volume in the Amazon Redshift fleet have predicates on DATE or TIMESTAMP columns; and 3) Not surprisingly, customer workloads tend to query recent (hot) data more often than historical (cold) data. One such query pattern representative of these customer workloads, derived from the industry standard TPC-H analytics benchmark, is as follows:

SELECT ...
FROM   lineitem
       JOIN orders ON l_orderkey = o_orderkey
       JOIN customer ON ...
WHERE l_shipdate >= current_date - $1
  AND ...

Figure 1: Amazon Redshift fleet metrics on temporal vs non-temporal data types

Solution overview

Amazon Redshift introduced a new selectivity estimation technique in Amazon Redshift patch release P183 (v1.0.75379) to address the situation — having up-to-date statistics on temporal columns improving query plans and thereby performance. The new technique captures real-time statistical metadata gathered during data ingestion without incurring additional computational overhead. For queries with range predicates on temporal columns, the query optimizer uses this additional metadata fetched at runtime to complement the existing statistics, elastically adjusting the histogram boundaries, leading to improved selectivity estimations for temporal predicates. See Figures 2 & 3 for the performance improvements that elastic histograms for selectivity estimation delivers. This query processing optimization is enabled by default requiring no configuration changes or user intervention from users to realize the benefits of automatic optimization and improved query performance.

Benchmark evaluation

We evaluated the new selectivity estimation technique on variations of TPC-H queries. In one variation, the query performs an n-way join between lineitem, orders, and other tables with multiple predicates, including on l_shipdate.

When histogram statistics were stale, the selectivity estimations of predicates on l_shipdate were incorrectly predicted. This led to a sub-optimal query plan with a join order involving large network-heavy data redistributions among the compute resources of the Amazon Redshift provisioned cluster or serverless workgroup. With the new selectivity estimation technique, the prediction became much more accurate, leading to an optimal query plan with a join order that minimized the redistribution of results between join steps, resulting in a performance improvement shown in Figure 2.

Figure 2: Relative performance of TPC-H query variant (lower is better)

Figure 3: Query Plan comparison: Before enhancement (left), After enhancement (right)

Conclusion

In this post, we covered new performance optimizations in Redshift data warehouse query processing and how elastic histogram statistics help enhance selectivity estimation and the overall quality of query plans for Amazon Redshift data warehouse queries in the absence of fresh table statistics.

In summary, Amazon Redshift now offers enhanced query performance with optimizations such as Enhanced Histograms for Selectivity Estimation in the absence of fresh statistics by relying on metadata statistics gathered during ingestion. These optimizations are enabled by default and Amazon Redshift users will benefit with better query response times for their workloads. Amazon Redshift is on a mission to continuously improve performance and therefore overall price-performance. The new selectivity estimation enhancement has already improved the performance of hundreds of thousands of customer queries in the Amazon Redshift fleet since its introduction in the patch release P183. It’s worth noting that this is one of the many behind-the-scenes improvements we continually make to keep Redshift the industry leader in price-performance.

We invite you to try the numerous new features introduced in Amazon Redshift together with the new performance enhancements. For more information, reach out to your AWS account team to request a free consultation or a demo of Amazon Redshift. They will be happy to provide additional guidance and support on choosing the right analytics solution that meets your business needs.

About the authors

Roger Kim is a Software Development Engineer on the Amazon Redshift team focusing on query performance and optimization. He holds a BA in Computer Science and Mathematics from Cornell University.

Mohammed Alkateb is an Engineering Manager at Amazon Redshift. Prior to joining Amazon, Mohammed had 12 years of industry experience in query optimization and database internals as an Individual Contributor and Engineering Manager. Mohammed has 18 US patents, and he has publications in research and industrial tracks of premier database conferences including EDBT, ICDE, SIGMOD and VLDB. Mohammed holds a PhD in Computer Science from The University of Vermont, and MSc and BSc degrees in Information Systems from Cairo University.

Mengchu Cai is a principal engineer on the Amazon Redshift team. Mengchu currently works on query optimization and data lake query performance. He also led the development of SQL language features. Mengchu received his PhD in Computer Science and Engineering from the University of Nebraska Lincoln.

Ravi Animi is a Senior Product Leader on the Amazon Redshift team and manages several functional areas of Amazon Redshift analytics, data, and AI, including spatial analytics, streaming analytics, query performance, Spark integration, and analytics business strategy. He has experience with relational databases, multi-dimensional databases, IoT technologies, storage and compute infrastructure services, and more recently, as a startup founder in the areas of AI and deep learning. Ravi holds dual Bachelors degrees in Physics and Electrical Engineering from Washington University, St. Louis, a Masters in Engineering from Stanford, and an MBA from Chicago Booth.

AWS Big Data Blog