AWS Big Data Blog
Building high-quality benchmark tests for Amazon Redshift using SQLWorkbench and psql
In the introductory post of this series, we discussed benchmarking benefits and best practices common across different open-source benchmarking tools. In this post, we discuss benchmarking Amazon Redshift with the SQLWorkbench and psql open-source tools. Let’s first start with a quick review of the introductory installment.
When you use Amazon Redshift to scale compute and storage independently, a need arises to profile the compute requirements of various production workloads so that your Amazon Redshift cluster configuration reflects an appropriately balanced compute layer. You also need an approach to scale up with eyes wide open, and benchmarking different Amazon Redshift cluster configurations against various production workloads can help you appropriately accommodate workload expansion. In addition, you may also use benchmark tests to proactively monitor a production cluster’s performance in real time.
For prospective Amazon Redshift customers, benchmarking Amazon Redshift is often one of the main components of their evaluation and a key source of insight into the price-to-performance ratio of different Amazon Redshift configurations.
Open-source tools, with their cost-efficiency and vendor neutrality, are often the preferred choice for profiling production workloads and benchmark tests. However, best practices for using these tools are scarce, possibly resulting in flawed compute profiles, flawed benchmark results, customer frustration, and bloated timelines.
One final point before we get started: there is a lot that could be said about benchmarking—more than can be accommodated in a single post. Analytics Specialists Solutions Architects such as myself frequently and happily engage with current and prospective customers to help you evaluate your benchmarking strategy and approach at no charge. I highly recommend you take advantage of that benefit by reaching out to your AWS account Solutions Architect.
SQLWorkbench
SQLWorkbench, also referred to as SQLWorkbench/J, is an open-source SQL query tool that you can freely download as a .zip file. It’s written in Java so it runs on Windows, Linux/UNIX, and macOS, and naturally requires a supported Java runtime environment (JRE). SQLWorkbench also requires a JDBC driver for the database (to download the latest Amazon Redshift JDBC driver, see Configuring a JDBC driver version 1.0 connection).
SQLWorkbench can run in GUI or console mode. I discuss both in this section, but in my experience, customers typically default to the GUI mode, so we explore that version first. Also, I have found that customers that use SQLWorkbench often use it in a Windows environment (something to keep in mind if operating system has a determination on which open-source tool you use).
Typically, you stand up a Windows EC2 instance to serve as your benchmark host, and install SQLWorkbench on that machine. When you have SQLWorkbench running, setting up a connection to your Amazon Redshift cluster is quite easy. For this post, I assume you’re familiar with the basics of JDBC connections. The following screenshot shows what the SQLWorkbench connection dialog box might look like when populated with connection information.
After establishing a successful connection to your Amazon Redshift cluster, a query tab opens, in which you can write and run SQL queries similar to that shown in the following screenshot.
For benchmark tests, it’s highly recommended to set the maxrows field to a relatively low number to avoid noise from long transmission times of large result sets.
Unlike the LIMIT
clause in a SQL SELECT
statement, which can alter (short-circuit) Amazon Redshift query processing, setting the maxrows
field (whether to a value as low as 1 or something much higher) has no impact on query processing in Amazon Redshift; maxrows
only impacts SQLWorkbench’s rendering workload and overhead. You can easily verify this by running the same query multiple times with different maxrows
settings and observing that the number of rows returned for each query on the Amazon Redshift console query history page doesn’t change. Although the resulting query times should still be considered as query runtimes, they certainly help you get closer to a query’s execution time. Setting the maxrows
field to a relatively low number also reduces the risk of SQLWorkbench running into an out-of-memory error from very large result sets.
This straightforward GUI interface is appealing because it has a minimal learning curve and quickly enables you to start submitting benchmark tests against your Amazon Redshift cluster. SQLWorkbench is a very useful tool, and it may be a good fit for informal or simple benchmark tests that deal with a handful of benchmark queries, relatively small tables (such as under 50 million rows in a fact table), and are focused more on determining general directionality of query runtimes (for example, cluster A was faster than cluster B at running business query 123), rather than capturing accurate query runtimes. The GUI interface can also be helpful for quickly and easily tweaking test queries to be more or less intense, or to correct SQL syntax if the query originated from a different platform.
However, for more formal and complex benchmark tests that deal with large tables and must capture accurate query runtimes, SQLWorkbench’s straightforward GUI interface faces a scalability challenge: inputting potentially hundreds or thousands of benchmark queries, running them sequentially or simultaneously, and capturing their runtimes in a practical manner can prove to be a huge challenge.
In addition, SQLWorkBench’s rendering and processing times for query result sets are added to a query’s runtime, and so even moderately sized query result sets can lead to potentially significant noise in query runtimes. For example, I recently observed a customer reduce their query runtimes by several orders of magnitude by switching to a command line tool while keeping all other aspects of their benchmark tests and environment constant. Some of the queries were straightforward filter queries with no joins, returning 400,000 rows from a 2 billion-row fact table with approximately 30 mostly integer columns.
Using console mode
One way to minimize the scale problem and rendering noise is to switch to SQLWorkbench console mode (the command line interface), which comes bundled with the GUI version of SQLWorkbench in the same downloadable .zip file.
In this section, we show one way to enter console mode from the Windows command line prompt (note the -showTiming=true
flag that enables query execution times print on the screen) and connect to an Amazon Redshift cluster.
The following code starts SQLWorkbench in console mode:
When you’re in console mode, use the following command to connect to an Amazon Redshift cluster:
Again, it’s recommended to set the maximum rows for the results sets to a relatively low number, using the following command:
Although console mode may have a slightly higher learning curve, it can significantly reduce potential rendering noise in a query’s runtime. In addition, SQLWorkbench’s console mode lends itself to scripting, which opens the door to many more sophisticated benchmarking scenarios, particularly when simulating concurrent users and capturing sophisticated metrics.
Comparing performance of SQLWorkbench modes
Let’s use an example use case to demonstrate the potential performance differences of both modes of SQLWorkbench. Although Example Corp is a hypothetical company, the use case is quite typical and realistic, and the benchmark results presented are based on actual customer experiences.
Example Corp has onboarded terabytes of data, over 100 ETL jobs, and thousands of business users to our Amazon Redshift deployment over the past quarter. Data architects and engineers have observed the Amazon Redshift cluster’s average CPU utilization steadily increase, and now wish to scale up the cluster before onboarding additional data, ETL jobs, and users waiting in the project pipeline.
To determine the optimal cluster size, we perform a few simple benchmark tests on different cluster configurations. We first identify five or so sufficiently complex production queries for benchmarking clusters of different sizes and instance types. We decide query runtime is a sufficient measure of the optimal cluster size, because we’re mainly interested in directional guidance (for example, query runtimes improved significantly with 1.5x cluster size, but only marginally with larger than 1.5x cluster sizes).
We can use an Amazon Redshift snapshot from our production cluster to quickly stand up a few differently configured clusters varying in node size or node type (such as ra3.4xl vs. ra3.16xl). We use a production snapshot to create the benchmark clusters so we can keep the cluster data identical.
However, manually running the benchmark queries individually using the SQLWorkbench GUI shows query runtimes actually increased in most cases (compared to the original production cluster) despite the more powerful clusters! Upon a closer look, we realize internet transport noise has not been isolated from the query runtimes. We stand up a dedicated test EC2 machine in the same VPC and Availability Zone as our benchmark Amazon Redshift clusters and install a SQLWorkbench GUI client.
Running the benchmark queries using the SQLWorkbench GUI provides similar query runtimes as the original cluster configuration. Again, not what was expected. Upon switching to SQLWorkbench console mode, however, we observe an improvement in query runtimes by several orders of magnitude.
psql
In my experience, psql is the preferred open-source command line query tool for customers running in a Linux/UNIX environment, so in this post, I assume a Linux EC2 instance is being used to run psql. If the standard Amazon Linux AMI was chosen (usually the first one in the list) during EC2 creation, you can use the following commands to update and verify psql v9.2 on the Linux EC2 instance:
Feel free to also search the freely available community AMIs, which might have newer versions of PostGreSQL server and the psql client pre-installed.
After psql is installed, connecting to an Amazon Redshift cluster is pretty straightforward by specifying a few command line parameters:
The standard Amazon Redshift port is 5439, but I use port 8192 in the following code because of certain firewall requirements in my environment:
The following screenshot shows our output.
After you connect to the Amazon Redshift cluster, be sure to run the \timing on
command to enable query timing.
It’s also highly recommended that you consider setting the FETCH_COUNT
variable to a relatively low number on the psql console to avoid long transmission times for large result sets:
By setting this variable, database cursors and the FETCH command are used in conjunction with queries. Setting this variable has no impact on query processing in Amazon Redshift, but rather the number of rows returned to the client application from the fully materialized result set.
Although the command line nature of psql may have a slightly higher learning curve than similar GUI applications, it also helps keep it lightweight and introduces minimal processing noise into a query’s runtime. For example, I observed a customer’s query runtime improve by several orders of magnitude by simply switching from a GUI tool to command line psql, while keeping all other aspects of the benchmark test and environment constant.
In addition, psql’s command line interface lends itself to scripting, which opens the door to many more sophisticated benchmarking scenarios, particularly when simulating concurrent users and capturing sophisticated concurrency metrics. In fact, a number of customizable, parameter-driven scripts have already been written by AWS Analytics Specialists such as myself for sophisticated benchmarking compute and concurrency scenarios, and are freely available to current and prospective customers.
Another utility that you can use in combination with such scripts is Simple Replay, a utility that is freely available on the Amazon Redshift Utilities GitHub repo. Simply Replay can extract workload histories from a source Amazon Redshift cluster and replay those workloads (using the psql command line client) with high fidelity on a different (such as a benchmark test) Amazon Redshift cluster.
For Simple Replay to extract workload details from an Amazon Redshift cluster, audit logging must be enabled in the cluster, and it may take about an hour for the most recent workloads to propagate to the audit logs.
After we run the extract command, Simple Replay extracts workload information such as the connection patterns (for example, number of users and their connection timing), COPY and UNLOAD commands, and other SQL queries so that they can be replayed on a different Amazon Redshift cluster with high fidelity (and, in our case, using psql command line as the SQL client). The following screenshot shows our output.
The workload details are typically stored in an Amazon Simple Storage Service (Amazon S3) bucket, which is specified in the Simple Replay configuration file, among other properties. See the following screenshot.
After running the python3 Extraction.py extraction.yaml
command on the command line, we can review the workload details in our target S3 bucket to verify that the expected complexity was captured. The following screenshot shows the workload details on the Amazon S3 console.
The next step is to replay the extracted workload on a baseline cluster that mirrors our production cluster configuration (to establish a baseline runtime profile) and one or more target clusters using Simple Replay’s replay capability, as shown in the following screenshot.
Now let’s take another look at the example scenario presented in the previous section to demonstrate using the psql command line client with Simple Replay. Again, Example Corp has onboarded terabytes of data, over 100 ETL jobs, and thousands of business users to our Amazon Redshift deployment over the past quarter. Data architects and engineers have observed the Amazon Redshift cluster’s average CPU utilization steadily increase, and now wish to scale the cluster up (again) before onboarding additional data, ETL jobs, and users waiting in the project pipeline.
To determine the optimal cluster size, we first use the Simple Replay utility to extract information on all concurrent workloads that have occurred in the past 48 hours, from one-time user queries to BI reporting queries to ETL transformations. After we extract the information from the logs of the source Amazon Redshift cluster, we replay the same workloads on various benchmark cluster configurations. We may repeat this process for other timeframes in the past, such as month-end reporting or timeframes that exhibited unexpected workload spikes. To determine the optimal cluster size, the Example Corp team observes the CPU utilization of each benchmark cluster configuration and chooses the best cluster offering the best price-to-performance ratio.
For other capabilities and functionality in psql scripts, I recommend you reach out to your AWS account SA to evaluate available benchmarking scripts in relation to your needs and perhaps avoid “reinventing the wheel.”
Conclusion
In this series of posts, we discussed a number of recommended best practices for conducting high-quality benchmark tests. Some of the best practices represented core principles that span all the open-source tools discussed (such as consistency in testing methodology). In this post, we reviewed the strengths and appropriateness of SQLWorkbench and psql for conducting benchmark tests. I hope this series has been helpful, and strongly encourage current and prospective customers to reach out to me or other AWS colleagues if you wish to delve deeper.
About the Author
Asser Moustafa is an Analytics Specialist Solutions Architect at AWS based out of Dallas, Texas. He advises customers in the Americas on their Amazon Redshift and data lake architectures and migrations, starting from the POC stage to actual production deployment and maintenance.