Catalog and analyze Application Load Balancer logs more efficiently with AWS Glue custom classifiers and Amazon Athena

You can query Application Load Balancer (ALB) access logs for various purposes, such as analyzing traffic distribution and patterns. You can also easily use Amazon Athena to create a table and query against the ALB access logs on Amazon Simple Storage Service (Amazon S3). (For more information, see How do I analyze my Application Load Balancer access logs using Amazon Athena? and Querying Application Load Balancer Logs.) All queries are run against the whole table because it doesn’t define any partitions. If you have several years of ALB logs, you may want to use a partitioned table instead for better query performance and cost control. In fact, partitioning data is one of the Top 10 performance tuning tips for Athena.

However, because ALB log files aren’t stored in a Hive-style prefix (such as /year=2021/), the process of creating thousands of partitions using ALTER TABLE ADD PARTITION in Athena is cumbersome. This post shows a way to create and schedule an AWS Glue crawler with a Grok custom classifier that infers the schema of all ALB log files under the specified Amazon S3 prefix and populates the partition metadata (year, month, and day) automatically to the AWS Glue Data Catalog.

Prerequisites

To follow along with this post, complete the following prerequisites:

Enable access logging of the ALBs, and have the files already ingested in the specified S3 bucket.
Set up the Athena query result location. For more information, see Working with Query Results, Output Files, and Query History.

Solution overview

The following diagram illustrates the solution architecture.

To implement this solution, we complete the following steps:

Prepare the Grok pattern for our ALB logs, and cross-check with a Grok debugger.
Create an AWS Glue crawler with a Grok custom classifier.
Run the crawler to prepare a table with partitions in the Data Catalog.
Analyze the partitioned data using Athena and compare query speed vs. a non-partitioned table.

Prepare the Grok pattern for our ALB logs

As a preliminary step, locate the access log files on the Amazon S3 console, and manually inspect the files to observe the format and syntax. To allow an AWS Glue crawler to recognize the pattern, we need to use a Grok pattern to match against an expression and map specific parts into the corresponding fields. Approximately 100 sample Grok patterns are available in the Logstash Plugins GitHub, and we can write our own custom pattern if it’s not listed.

The following the basic syntax format for a Grok pattern %{PATTERN:FieldName}

The following is an example of an ALB access log:

http 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.000 0.001 0.000 200 200 34 366 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337262-36d228ad5d99923122bbe354" "-" "-" 0 2018-07-02T22:22:48.364000Z "forward" "-" "-" "10.0.0.1:80" "200" "-" "-"
https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337281-1d84f3d73c47ec4e58577259" "www.example.com" "arn:aws:acm:us-east-2:123456789012:certificate/12345678-1234-1234-1234-123456789012" 1 2018-07-02T22:22:48.364000Z "authenticate,forward" "-" "-" "10.0.0.1:80" "200" "-" "-"
h2 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.1.252:48160 10.0.0.66:9000 0.000 0.002 0.000 200 200 5 257 "GET https://10.0.2.105:773/ HTTP/2.0" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337327-72bd00b0343d75b906739c42" "-" "-" 1 2018-07-02T22:22:48.364000Z "redirect" "https://example.com:80/" "-" "10.0.0.66:9000" "200" "-" "-"
ws 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.0.140:40914 10.0.1.192:8010 0.001 0.003 0.000 101 101 218 587 "GET http://10.0.0.30:80/ HTTP/1.1" "-" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 1 2018-07-02T22:22:48.364000Z "forward" "-" "-" "10.0.1.192:8010" "101" "-" "-"
wss 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.0.140:44244 10.0.0.171:8010 0.000 0.001 0.000 101 101 218 786 "GET https://10.0.0.30:443/ HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 1 2018-07-02T22:22:48.364000Z "forward" "-" "-" "10.0.0.171:8010" "101" "-" "-"
http 2018-11-30T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 - 0.000 0.001 0.000 200 200 34 366 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 0 2018-11-30T22:22:48.364000Z "forward" "-" "-" "-" "-" "-" "-"
http 2018-11-30T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 - 0.000 0.001 0.000 502 - 34 366 "GET http://www.example.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 0 2018-11-30T22:22:48.364000Z "forward" "-" "LambdaInvalidResponse" "-" "-" "-" "-"

To map the first field, the Grok pattern might look like the following code:

%{DATA:type}\s

The pattern includes the following components:

DATA maps to .*?
type is the column name
\s is the whitespace character

To map the second field, the Grok pattern might look like the following:

%{TIMESTAMP_ISO8601:time}\s

This pattern has the following elements:

TIMESTAMP_ISO8601 maps to %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
time is the column name
\s is the whitespace character

When writing Grok patterns, we should also consider corner cases. For example, the following code is a normal case:

%{BASE10NUM:target_processing_time}\s

But when considering the possibility of null value, we should replace the pattern with the following:

%{DATA:target_processing_time}\s

When our Grok pattern is ready, we can test the Grok pattern with sample input using a third-party Grok debugger. The following pattern is a good start, but always remember to test it with the actual ALB logs.

%{DATA:type}\s+%{TIMESTAMP_ISO8601:time}\s+%{DATA:elb}\s+%{DATA:client}\s+%{DATA:target}\s+%{BASE10NUM:request_processing_time}\s+%{DATA:target_processing_time}\s+%{BASE10NUM:response_processing_time}\s+%{BASE10NUM:elb_status_code}\s+%{DATA:target_status_code}\s+%{BASE10NUM:received_bytes}\s+%{BASE10NUM:sent_bytes}\s+\"%{DATA:request}\"\s+\"%{DATA:user_agent}\"\s+%{DATA:ssl_cipher}\s+%{DATA:ssl_protocol}\s+%{DATA:target_group_arn}\s+\"%{DATA:trace_id}\"\s+\"%{DATA:domain_name}\"\s+\"%{DATA:chosen_cert_arn}\"\s+%{DATA:matched_rule_priority}\s+%{TIMESTAMP_ISO8601:request_creation_time}\s+\"%{DATA:actions_executed}\"\s+\"%{DATA:redirect_url}\"\s+\"%{DATA:error_reason}\"\s+\"%{DATA:target_list}\"\s+\"%{DATA:target_status_code_list}\"\s+\"%{DATA:classification}\"\s+\"%{DATA:classification_reason}\"

Keep in mind that when you copy the Grok pattern from your browser, in some cases there are extra spaces in the end of the lines. Make sure to remove these extra spaces.

Create an AWS Glue crawler with a Grok custom classifier

Before you create your crawler, you first create a custom classifier. Complete the following steps:

On the AWS Glue console, under Crawler, choose Classifiers.
Choose Add classifier.
For Classifier name, enter alb-logs-classifier.
For Classifier type¸ select Grok.
For Classification, enter alb-logs.
For Grok pattern, enter the pattern from the previous section.
Choose Create.

Now you can create your crawler.

Choose Crawlers in the navigation pane.
Choose Add crawler.
For Crawler name, enter alb-access-log-crawler.
For Selected classifiers, enter alb-logs-classifier.
Choose Next.
For Crawler source type, select Data stores.
For Repeat crawls of S3 data stores, select Crawl new folders only.
Choose Next.
For Choose a data store, choose S3.
For Crawl data in, select Specified path in my account.
For Include path, enter the path to your ALB logs (for example, s3://alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/).
Choose Next.
When prompted to add another data store, select No and choose Next.
Select Create an IAM role, and give it a name such as AWSGlueServiceRole-alb-logs-crawler.
For Frequency, choose Daily.
Indicate your start hour and minute.
Choose Next.
For Database, enter elb-access-log-db.
For Prefix added to tables, enter alb_logs_.
Expand Configuration options.
Select Update all new and existing partitions with metadata from the table.
Keep the other options at their default.
Choose Next.
Review your settings and choose Finish.

Run your AWS Glue crawler

Next, we run our crawler to prepare a table with partitions in the Data Catalog.

On the AWS Glue console, choose Crawlers.
Select the crawler we just created.
Choose Run crawler.

When the crawler is complete, you receive a notification indicating that a table has been created.

Next, we review and edit the schema.

Under Databases, choose Tables.
Choose the table alb_logs_<region>.
Cross-check the column name and corresponding data type.

The table has three columns: partiion_0, partition_1, and partition_2.

Choose Edit schema.
Rename the columns year, month, and day.
Choose Save.

Analyze the data using Athena

Next, we analyze our data by querying the access logs. We compare the query speed between the following tables:

Non-partitioned table – All data is treated as a single table
Partitioned table – Data is partitioned by year, month, and day

Query the non-partitioned table

With the non-partitioned table, if we want to query access logs on a specific date, we have to write the WHERE clause using the LIKE operator because the data column was interpreted as a string. See the following code:

SELECT COUNT(1) FROM "elb-access-log-db"."alb_logs" WHERE type='h2' AND time LIKE '2020-12-29%';

The query takes 5.25 seconds to complete, with 3.15 MB data scanned.

Query the partitioned table

With the year, month, and day columns as partitions, we can use the following statement to query access logs on the same day:

SELECT COUNT(1) FROM "elb-access-log-db"."alb_logs" WHERE type='h2' AND year=2020 AND month=12 AND day=29;

This time the query takes only 1.89 seconds to complete, with 25.72 KB data scanned.

This query is faster and costs less (because less data is scanned) due to partition pruning.

Clean up

To avoid incurring future charges, delete the resources created in the Data Catalog, and delete the AWS Glue crawler.

Summary

In this post, we illustrated how to create an AWS Glue crawler that populates ALB logs metadata in the AWS Glue Data Catalog automatically with partitions by year, month, and day. With partition pruning, we can improve query performance and associated costs in Athena.

If you have questions or suggestions, please leave a comment.

About the Authors

Ray Wang is a Solutions Architect at AWS. With 8 years of experience in the IT industry, Ray is dedicated to building modern solutions on the cloud, especially in big data and machine learning. As a hungry go-getter, he passed all 12 AWS certificates to make his technical field not only deep but wide. He loves to read and watch sci-fi movies in his spare time.

Corvus Lee is a Data Lab Solutions Architect at AWS. He enjoys all kinds of data-related discussions with customers, from high-level like white boarding a data lake architecture, to the details of data modeling, writing Python/Spark code for data processing, and more.

AWS Big Data Blog