AWS Database Blog
Monitoring Amazon DynamoDB for operational awareness
Amazon DynamoDB is a serverless database, and is responsible for the undifferentiated heavy lifting associated with operating and maintaining the infrastructure behind this distributed system. As a customer, you use APIs to capture operational data that you can use to monitor and operate your tables. This post describes a set of metrics to consider when building out your dashboards and alarms to operationalize DynamoDB.
You can use Amazon CloudWatch metrics published by DynamoDB to help you understand the interaction of your evolving workload with DynamoDB in the context of your data model. The metrics are separated into the following three categories, based on the resource level at which they apply:
- Metrics that are provided out of the box with DynamoDB (noted as “Out of the Box”).
- Metrics that require computation via metric math (noted as “Requires metric math”).
- Metrics that must be self-published to Amazon CloudWatch using a custom AWS Lambda function.
As you move toward production, you can also get recommendations on achieving operational excellence with DynamoDB.
To download the code to publish the custom metrics that you need in this example, see the GitHub repo. The Lambda function for publishing the custom CloudWatch metrics accepts a number of environment variables for overriding default settings, check the README for details. At the time of publication of this post, these are:
- CLOUDWATCH_CUSTOM_NAMESPACE – By default, the AWS Lambda function publishes metrics to the “Custom_DynamoDB” namespace. If you’d like to change it, set the CLOUDWATCH_CUSTOM_NAMESPACE environment variable.
- DYNAMODB_ACCOUNT_TABLE_LIMIT – By default, the AWS Lambda function assumes your DynamoDB account table limit is 256. There is no API call to determine your account table limit, so if you’ve asked AWS to increase this limit for your account you must set the DYNAMODB_ACCOUNT_TABLE_LIMIT to that value for the AWS Lambda function to calculate the AccountTableLimitPct custom metric properly.
The AWS CloudFormation examples in this post assume that an SNS topic, referred to as DynamoDBMonitoringSNSTopic
, exists for alarms to send notifications to. It also assumes that the template contains parameters such as DynamoDBProvisionedTableName
, DynamoDBOnDemandTableName
, DynamoDBGlobalTableName
, and DynamoDBGlobalTableReceivingRegion
. Additionally, the global secondary indexes (GSIs) are named the same as the table, but with -gsi1
added. For example, dynamodb-monitoring-gsi1
.
The alarm thresholds provided in each section are recommendations of a reasonable starting point, which you could adjust based on requirements and workload patterns.
Metrics for each account and Region
There are a few account-level metrics for each AWS Region within an account that you must monitor. These are particularly important if you have multiple teams deploying DynamoDB tables into the same account, and one team’s change can impact the ability for another team’s table to auto scale, for example, and the account administrator might need to take action to raise an account’s limits. The following table summarizes the DynamoDB metrics and recommended alarm configurations for each Region in your AWS Account.
Description | Metric | Alarm config | Notes |
Percentage of account limit read provisioned capacity allocated | AccountProvisionedReadCapacityUtilization |
MAX > 80% | Out of the box |
Percentage of account limit write provisioned capacity allocated | AccountProvisionedWriteCapacityUtilization |
MAX > 80% | Out of the box |
Percentage of read provisioned capacity used by the highest read provisioned table of an account | MaxProvisionedTableReadCapacityUtilization |
MAX > 80% | Out of the box |
Percentage of write provisioned capacity used by the highest write provisioned table of an account | MaxProvisionedTableWriteCapacityUtilization |
MAX > 80% | Out of the box |
Percentage of table count limit in use | AccountTableLimitPct |
> 80% | Requires custom AWS Lambda function |
The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metrics:
To use the DynamoDB console to create the alarm, complete the following steps:
- On the DynamoDB console, choose Tables.
- Within a table, choose Metrics.
- Choose Create Alarm.
The following screenshot shows the Create Alarm section. For more information, please see Creating CloudWatch Alarms to Monitor DynamoDB.
Metrics for each table and GSI
Some metrics need monitoring and alerts for every table and GSI. For example, sustained heavy throttling might indicate a schema design issue or a table misconfiguration with no auto scaling, or auto scaling limits set too low. Such issues might need intervention and either AWS configuration or application code changes to resolve them. Amazon CloudWatch Contributor Insights for DynamoDB can help you explore whether you have frequently accessed items causing sustained throttling.
The following table summarizes the DynamoDB metrics and recommended alarm configurations for each DynamoDB table and GSI, regardless of billing mode.
Description | Metric | Alarm config | Notes |
Sustained read throttling | Sample Count ReadThrottleEvents / (Sample Count ConsumedReadCapacityUnits) |
> 2% | Requires metric math |
Sustained write throttling | Sample Count Write ThrottleEvents / (Sample Count ConsumedWriteCapacityUnits) |
> 2% | Requires metric math |
Sustained significant elevation of system errors | Sample Count SystemErrors / (Sample Count ConsumedReadCapacityUnits + Sample Count ConsumedWriteCapacityUnits) |
> 2% | Requires metric math |
Sustained significant elevation of user errors | Sample Count UserErrors / (Sample Count ConsumedReadCapacityUnits + Sample Count ConsumedWriteCapacityUnits) |
> 2% | Requires metric math |
Sustained significant elevation of condition check errors (optional) | ConditionalCheckFailedRequests |
SUM > 100 | Out of the box |
Sustained significant elevation of transaction conflicts (optional) | TransactionConflict |
SUM > 100 | Out of the box |
The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metrics. This example uses metric math and alarms on read throttling for a GSI, instead of the base table to show how GSI dimensions work. To scale the ratio of throttled events to total read events to the range [0, 100]
, the code multiplies it by 100.
Metrics for each provisioned throughput table and GSI
As a best practice, you should enable DynamoDB auto scaling on any table using provisioned throughput (“PROVISIONED Billing Mode
”), for both the base table and all GSIs. Doing so can reduce costs by scaling down during times of low usage, and minimizes throttling due to under-provisioning during unanticipated load peaks.
The following table shows table and GSI metrics that are scaled either as a percentage of the table’s provisioned throughput settings, or as a percentage of the auto scaling maximums. When a table approaches the configured maximum, you receive an alert so you can increase the maximum or investigate the unusual level of application load.
Description | Metric | Alarm config | Notes |
Percentage utilization of auto scaling read maximum | ProvisionedReadCapacityAutoScalingPct |
> 90% | Requires custom AWS Lambda function |
Percentage utilization of auto scaling write maximum | ProvisionedWriteCapacityAutoScalingPct |
> 90% | Requires custom AWS Lambda function |
The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metric. This metric is based on custom metrics published from an AWS Lambda function.
Metrics for each on-demand capacity table and GSI
Tables using on-demand capacity mode (PAY_PER_REQUEST Billing Mode
) have less to monitor because you can’t increase or decrease the current capacity settings. The primary concern is if the table is coming close to the account maximum limits for table level reads and writes. The following table summarizes the DynamoDB metrics and recommended alarm configurations for each DynamoDB table and GSI using the PAY_PER_REQUEST Billing Mode
.
Description | Metric | Alarm config | Notes |
Read consumption as a percentage of the table limit | SUM ConsumedReadCapacityUnits / MAXIMUM AccountMaxTableLevelReads |
> 90% | Requires metric math |
Write consumption as a percentage of the table limit | SUM ConsumedWriteCapacityUnits / MAXIMUM AccountMaxTableLevelWrites |
> 90% | Requires metric math |
The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metric. The ConsumedCapacity
metrics are for requests sent per second accumulated over a minute. Because AccountMaxTableLevelWrites
represents the requests per second, you must scale them in the expression formula to keep the value in the range [0, 100]
.
Monitoring for DynamoDB global tables
DynamoDB global tables replicates data between tables in different Regions in a fully managed, multi-master format. With global tables using provisioned throughput, you must provision the same WCU settings across all the table replicas. Not doing so may result in a replica in one Region falling behind replicating changes from another Region. This could cause your replica to diverge from the other replicas. If your tables use auto scaling, all the replica tables should have the same auto scaling settings to have a consistent experience. Tables using on-demand throughput need not consider this issue.
It’s useful to know the replication latency of each AWS Region and alert if that replication latency increases continually. It might indicate an accidental misconfiguration in which the global table has different WCU settings in different Regions, which leads to replicated requests failing and increase latencies. It could also indicate that there is a Regional disruption. The actual latency depends on which Regions are involved (how far dispersed they are geographically) and is subject to some amount of Regional fluctuation. Replication latencies longer than 3 minutes are generally cause for investigation, however, you should pick a number that makes sense for your use case and requirements.
The following table summarizes the DynamoDB global tables metrics and recommended alarm configurations for each of your global tables. You want to configure the alarm and dashboard in each Region participating in the global table.
Description | Metric | Alarm config | Notes |
Elevated replication latency between two Regions | ReplicationLatency |
AVERAGE > 180,000 milliseconds (3 minutes) | Out of the box |
The following code is an example AWS CloudFormation template for the first metric in the preceding table, which you can modify for the other metric. This alarm requires you to specify from which Region (referred to by the DynamoDBGlobalTableReceivingRegion
parameter) you want to measure the latency. If your global table has more than two Regions participating, you must to set up multiple alarms in each Region.
AWS Lambda users of DynamoDB Streams
Users who create Lambda functions triggered by changes on a DynamoDB table should generate alerts when objects sit for too long on the DynamoDB stream without being processed by the Lambda function. This could be evidence of a Lambda function with a defect (such as an unhandled exception), or a Lambda function that can’t handle events quickly enough and therefore causes an ever-deepening queue. When a system is optimized and performing well, you should see DynamoDB Streams events handled within a few seconds. The following table summarizes the Lambda metrics and recommended alarm configurations for each of your Lambda functions that are triggered by DynamoDB Streams events.
Description | Metric | Alarm config | Notes |
Elevated age of events on the DynamoDB stream | IteratorAge |
> 30,000 milliseconds (30 seconds) | Out of the box |
The following code is an example AWS CloudFormation template for the preceding metric. This alarm is based on your Lambda function name (referred to by the DynamoDBStreamLambdaFunctionName
parameter).
Conclusion
Though there are many more metrics that you could monitor and receive alerts on, this post gives you a good starting point on your path to operationalizing DynamoDB.
Want more Amazon DynamoDB how-to content, news, and feature announcements? Follow us on Twitter.
About the Authors
Chad Tindel is a DynamoDB Specialist Solutions Architect based out of New York City. He works with large enterprises to evaluate, design, and deploy DynamoDB-based solutions. Prior to joining Amazon he held similar roles at Red Hat, Cloudera, MongoDB, and Elastic.
Pete Naylor is a DynamoDB Specialist Solutions Architect based in Seattle. Prior to this, he was a Technical Account Manager supporting Amazon as a customer of AWS, with a focus on database migrations and operational excellence at scale. His career background is systems engineering for high availability in geographically diverse tier 1 workloads.
Pratik Agarwal is a Software Development Engineer for Amazon DynamoDB who works on the resource governance team. He focuses primarily on IOPS management, which includes DynamoDB auto scaling, adaptive capacity, and on-demand capacity mode.
Ankur Kasliwal is a Technical Program Manager for Amazon DynamoDB. He helps innovate, simplify project development structure and deliver results effectively and efficiently for our customers. In addition to that, he provides architectural guidance for AWS services to internal and external customers with a deep focus on solutions using Amazon DynamoDB.