AWS Compute Blog
Using Amazon SQS dead-letter queues to replay messages
This is courtesy of Alexandre Pinhel, Specialist SA Manager, in collaboration with Guillaume Marchand and Luke Hargreaves, Solutions Architects.
Amazon Simple Queue Service (Amazon SQS) is a fully managed message queuing service. It enables you to decouple and scale microservices, distributed systems, and serverless applications. A commonly used feature of Amazon SQS is dead-letter queues. The DLQ (dead-letter queue) is used to store messages that can’t be processed (consumed) successfully.
This post describes how to add automated resilience to an existing SQS queue. It monitors the dead-letter queue and moves a message back to the main queue to see if it can be processed again. It also uses a specific algorithm to make sure this is not repeated forever. Each time it attempts to reprocess the message, the replay time increases until the message is finally considered dead.
I use Amazon SQS dead-letter queues, AWS Lambda, and a specific algorithm to decrease the rate of retries for failed messages. I then package and publish this serverless solution in the AWS Serverless Application Repository.
Dead-letter queues and message replay
The main task of a dead-letter queue (DLQ) is to handle message failure. It allows you to set aside and isolate non-processed messages to determine why processing failed. Often these failed messages are caused by application errors. For example, a consumer application fails to parse a message correctly and throws an unhandled exception. This exception then triggers an error response that sends the message to the DLQ. The AWS documentation contains a tutorial detailing the configuration of an Amazon SQS dead-letter queue.
To process the failed messages, I build a retry mechanism by implementing an exponential backoff algorithm. The idea behind exponential backoff is to use progressively longer waits between retries for consecutive error responses. Most exponential backoff algorithms use jitter (randomized delay) to prevent successive collisions. This spreads the message retries more evenly across time, allowing them to be processed more efficiently.
Solution overview
The flow of the message sent by the producer to SQS is as follows:
- The producer application sends a message to an SQS queue
- The consumer application fails to process the message in the same SQS queue
- The message is moved from the main SQS queue to the default dead-letter queue as per the component settings.
- A Lambda function is configured with the SQS main dead-letter queue as an event source. It receives and sends back the message to the original queue adding a message timer.
- The message timer is defined by the exponential backoff and jitter algorithm.
- You can limit the number of retries. If the message exceeds this limit, the message is moved to a second DLQ where an operator processes it manually.
How the replay function works
Each time the SQS dead-letter queue receives a message, it triggers Lambda to run the replay function. The replay code uses an SQS message attribute `sqs-dlq-replay-nb` as a persistent counter for the current number of retries attempted. The number of retries is compared to the maximum number (defined in the application configuration file). If it exceeds the maximum, the message is moved to the human operated queue. If not, the function uses the AWS Lambda event data to build a new message for the Amazon SQS main queue. Finally it updates the retry counter, adds a new message timer to the message, and it sends the message back (replays) to the main queue.
def handler(event, context):
"""Lambda function handler."""
for record in event['Records']:
nbReplay = 0
# number of replay
if 'sqs-dlq-replay-nb' in record['messageAttributes']:
nbReplay = int(record['messageAttributes']['sqs-dlq-replay-nb']["stringValue"])
nbReplay += 1
if nbReplay > config.MAX_ATTEMPS:
raise MaxAttempsError(replay=nbReplay, max=config.MAX_ATTEMPS)
# SQS attributes
attributes = record['messageAttributes']
attributes.update({'sqs-dlq-replay-nb': {'StringValue': str(nbReplay), 'DataType': 'Number'}})
_sqs_attributes_cleaner(attributes)
# Backoff
b = backoff.ExpoBackoffFullJitter(base=config.BACKOFF_RATE, cap=config.MESSAGE_RETENTION_PERIOD)
delaySeconds = b.backoff(n=int(nbReplay))
# SQS
SQS.send_message(
QueueUrl=config.SQS_MAIN_URL,
MessageBody=record['body'],
DelaySeconds=int(delaySeconds),
MessageAttributes=record['messageAttributes']
)
How to use the application
You can use this serverless application via:
- The Lambda console: choose the “Browse serverless app repository” option to create a function. Select “amazon-sqs-dlq-replay-backoff” application in the public applications repository. Then, configure the application with the default SQS parameters and the replay feature parameters.
- The Serverless Framework, as described by Yan Cui in this blog post.
- An AWS CloudFormation template by using the
AWS::ServerlessRepo::Application
resource, as described in the documentation.
Here is an example of a CloudFormation template using the AWS Serverless Application Repository application:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
ReplaySqsQueue:
Type: AWS::Serverless::Application
Properties:
Location:
ApplicationId: arn:aws:serverlessrepo:eu-west-1:1234123412:applications~sqs-dlq-replay
SemanticVersion: 1.0.0
Parameters:
BackoffRate: "2"
MaxAttempts: "3"
Conclusion
I describe how an exponential backoff algorithm (with jitter) enhances the message processing capabilities of an Amazon SQS queue. You can now find the amazon-sqs-dlq-replay-backoff application in the AWS Serverless Application Repository. Download the code from this GitHub repository.
To get started with dead-letter queues in Amazon SQS, read:
To implement replay mechanisms, see:
- Increase your knowledge on the backoff algorithm reading this blog post by Marc Brooker.
- Leverage SQS Message Timers feature to manage the message visibility in the queue.
For more serverless learning resources, visit https://serverlessland.com.