AWS Storage Blog
Monitoring AWS Storage Gateway health and performance using Amazon CloudWatch
When managing a hybrid-cloud infrastructure, monitoring system health is essential for maintaining business continuity. Setting up comprehensive monitoring provides visibility into performance and availability of infrastructure components. By establishing alert thresholds and promptly responding to alarms, administrators can identify degraded performance or outages early. Quickly diagnosing and fixing the issues maximizes uptime.
AWS Storage Gateway, a hybrid cloud storage service that enables on-premises workloads to use AWS Storage services, provides native AWS monitoring metrics using Amazon CloudWatch. Although Storage Gateway provides metrics on health and performance, users need comprehensive observability of these attributes. Getting notified rapidly of gateway disruptions through CloudWatch even before users notice is key to make sure of 24/7 data access. Setting up these anomaly detection rules acts like a radar for availability.
In this post, we discuss a two pronged approach to achieve comprehensive observability of Storage Gateway health and performance. First, we enable recommended CloudWatch alarms, which focus on crucial performance metrics and system functionality. Second, we implement a custom monitoring solution that leverages CloudWatch custom metrics and an AWS Lambda function. This custom monitoring solution regularly checks the status of the Storage Gateway and can promptly notify you via Amazon Simple Notification Service (SNS) whenever the Storage Gateway enters an offline state. Together, these two monitoring solutions provide comprehensive oversight of your Storage Gateway’s performance, operational state, and overall health. This solution offers an early detection and notification mechanism, provides flexibility to tailor the monitoring to your specific needs, and reduces the risk of undetected issues or downtime.
How to monitor recommended Storage Gateway CloudWatch metrics
During the activation of Storage Gateway, there is an option to create recommended CloudWatch alarms. Each type of Storage Gateway has default CloudWatch alarms created without any CloudWatch actions:
- For Amazon S3 File Gateway, the recommended CloudWatch alarms are created for Storage Gateway’s CachePercentDirty, IOWaitPercent, and FileSharesUnavailable with default data points for an alarm.
- For Amazon FSx File Gateway, alarms are created for CachePercentDirty, IOWaitPercent, FilesFailingUpload, and FileSystem-ERROR.
- For both Volume and Tape Gateways, CloudWatch alarms are created for the metrics CachePercentDirty and IOWaitPercent.
The following image shows the recommended CloudWatch alarms for S3 File Gateway. We can add CloudWatch actions, such as adding a notification using SNS or triggering a Lambda function for these alarms.
Definitions for terminology
The following terminology explains the CloudWatch metrics that are included in recommended CloudWatch alarms for the Storage Gateway.
- CachePercentDirty refers to the percentage of data in cache that is yet to be uploaded to AWS. This often happens when application throughput is much more than the network bandwidth available to upload data to the Cloud.
- IOWaitPercent refers to the amount of time of CPU waits until the input/output (I/O) on the disk is completed. This metric is same as the Linux metric. A high value of this metric indicates a slow performing disk for the workload.
- FileSharesUnavailable refers to the number of file shares that are in the unavailable status. S3 file share is in the unavailable status if the Storage Gateway virtual machine (VM) has issues reaching Amazon S3 endpoints or permissions issues accessing the S3 bucket.
- FilesFailingUpload refers to the number of files that are failing to upload to FSx file system associated with Storage Gateway. Enable CloudWatch logs to identify the list of files that failed to upload.
- FileSystem-ERROR refers to the inaccessibility of mounting the FSx File System associated with Storage Gateway. This could be either due to the lack to permissions to mount the file system, or the unavailability of a network path between Amazon FSx File Gateway and FSx File System.
After successfully setting the recommended CloudWatch alarms, CloudWatch actions can be added to notify through email or other means using Amazon SNS.
In addition to these default recommended CloudWatch alarms, Storage Gateway also provides CloudWatch metrics to monitor the Gateway level, File Share, and Volume level metrics.
- CacheHitPercent refers to the percentage of read requests that were served from the gateway cache. This metric can be monitored that enables you to see how often data is being pulled from the local cache, as opposed to going out to the AWS Cloud. ReadBytes and WriteBytes metrics can be used to monitor the overall Read/Write throughput on Storage Gateway from local client machines. These metrics can also be monitored for individual volumes associated with Volume Storage Gateway.
- CloudBytesUploaded and CloudBytesDownloaded metrics are used to monitor the overall network throughput between the Storage Gateway VM and AWS. The combination of these metrics shows the overall use of network throughput to upload or download data from the AWS Cloud.
You can use CloudWatch alarms to monitor these CloudWatch metrics.
Custom monitoring solution overview
The solution includes CloudWatch custom metrics with a CloudWatch scheduled Lambda function that checks the Storage Gateway status and notifies the email configured as a subscription in Amazon SNS. This solution is applicable for all types of Storage Gateways in both AWS and on-premises locations.
The workflow of this solution, as shown in the preceding diagram:
- Amazon EventBridge schedule to trigger a Lambda function that describes the Storage Gateway status on a recurring schedule.
- Lambda function updates a CloudWatch custom metric value. CloudWatch alarm is created to monitor this CloudWatch custom metric.
- Amazon SNS is triggered when CloudWatch alarm breaches the set threshold value.
Prerequisites
For this walkthrough, you should have the following:
- An AWS account with Storage Gateway activated.
- Access to create AWS resources such as IAM role, Lambda function, EventBridge rule, CloudWatch metric, and Amazon SNS.
- IAM permissions to deploy the AWS CloudFormation
- The IAM role associated with the CloudFormation template should have access to AWS resources such as a Lambda function, EventBridge rule, CloudWatch metric, Amazon SNS, and permissions to create another IAM Role.
- Storage Gateway ID and Amazon SNS Amazon Resource Name (ARN)
Walkthrough: Monitor Storage Gateway status using Amazon CloudWatch custom metrics
In this section, we demonstrate setting up CloudWatch custom metrics to monitor the status of Storage Gateway.
We complete the following steps to deploy a solution for monitoring the status of Storage Gateway so that we can act immediately if it goes offline:
Step 1: Create an AWS IAM policy and role as well as a Lambda function (in the same AWS Region as your gateway) to check Storage Gateway status.
Step 2: Add Python code to the Lamba function.
Step 3: Create a test event to test for a successful execution of Lambda function.
Step 4: Set up an Amazon EventBridge rule to schedule the regular execution of the Lambda function at a specified frequency.
Step 5: Establish an Amazon SNS notification to alert the email address subscribed to the SNS topic.
Step 6: Create a CloudWatch metric alarm to notify SNS whenever Storage Gateway goes offline.
Let’s get started.
Step 1: Create an AWS IAM policy and role as well as a Lambda function (in the same AWS Region as your gateway) to check Storage Gateway status
1. Open the IAM console, select Policies and Create policy, select JSON, add the following statement, and create the policy after updating the Region and account ID. Modify the region and account-id in the following policy.
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup"
],
"Resource": "arn:aws:logs:region:account-id:*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:region:account-id:log-group:/aws/lambda/*:*"
},
{
"Effect": "Allow",
"Action": [
"storagegateway:DescribeGatewayInformation"
],
"Resource": "arn:aws:storagegateway:region:account-id:gateway/*"
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"cloudwatch:namespace": "StorageGatewayCustom"
}
}
}
]
2. Navigate to IAM Roles, select Create role, select AWS Service, choose Lambda as Use case, select Next, and choose the newly created IAM policy from Step 1.1. Name the role “SGW-State-Checker-Role”, and select Create role to create the IAM role.
3. Open the AWS Lambda console, navigate to the Region where Storage Gateway resides, and create a Lambda function.
a) Select Create function.
b) Select Author from scratch.
c) Set Function name as “SGW-State-Checker.”
d) Select Runtime: Python 3.12 and x86_64 as Architecture.
e) Permissions: Select Use an existing role and select the IAM role SGW-State-Checker_Role created prior.
f) Select Create function to complete the creation of the Lambda function.
This Lambda function creates a CloudWatch log group during its first execution. It is recommended to set up an encryption and retention period for this log group.
Step 2: Add Python code to the Lamba function
This code executes the Storage Gateway DescribeGatewayInformation API call for a given Storage Gateway and polls to check the Storage Gateway status. If the Storage Gateway status is running, then it creates a custom CloudWatch metric with a value of 1.0, otherwise it sets a value of 0.0.
1. Review the following code:
a) You can obtain the Storage Gateway ID by navigating to the AWS Management Console Storage Gateway. The Storage Gateway ID looks like: sgw-2321C24A.
b) You can also obtain this ID by entering the following command in an AWS Command Line Interface (AWS CLI) session (command returns gateway information for the Region specified):
$aws storagegateway list-gateways --region us-east-1
2. Paste this code in the code tab of the Lambda function and Deploy.
import boto3
import os
sgw_client = boto3.client('storagegateway')
cloudwatch_client = boto3.client('cloudwatch')
def lambda_handler(event, context):
gateway_id = event['gateway_id']
account_number = event['account_number']
region = os.environ['AWS_REGION']
gateway_arn = f"arn:aws:storagegateway:{region}:{account_number}:gateway/{gateway_id}"
gateway_state = get_gateway_state(gateway_arn)
metric_value = 1.0 if get_gateway_state(gateway_arn) == 'RUNNING' else 0.0
cw_dimensions = [{'Name': 'GatewayState', 'Value': 'RUNNING'}, {'Name': 'GatewayID', 'Value': gateway_id}]
cloudwatch_client.put_metric_data(
Namespace='StorageGatewayCustom',
MetricData=[{'MetricName': 'GatewayInfo', 'Value': metric_value, 'Dimensions': cw_dimensions}]
)
def get_gateway_state(gateway_arn):
try:
sgw_info = sgw_client.describe_gateway_information(GatewayARN=gateway_arn)
return sgw_info['GatewayState']
except sgw_client.exceptions.InvalidGatewayRequestException as err:
if err.response['message'] == "The specified gateway is not connected.":
return 'OFFLINE'
raise err
3. Navigate to the Configuration tab, select General configuration, select Edit and change the Timeout to 10 seconds.
Step 3: Create a test event to test for a successful execution of Lambda function
To test that the Lambda execution is successful and the CloudWatch custom metrics are set as expected, create a test event for Lambda:
1. Navigate to Code tab, Select Test and choose Configure test event from the dropdown.
2. Select Create a new event.
3. Event name: SGWStateChecker.
4. Provide the following value in the Event JSON section and select Format JSON. Provide your gateway_id
and account_number
as follows:
{
"gateway_id": "sgw-2321C24A",
"account_number": "123456789101"
}
5. Select Save and make sure your newly created SGWStateChecker test is selected from the drop down.
6. Select Test and you should observe a similar response as the following if the test is successful. A null response indicates that the script is executed successfully without any error. This test will encounter an error if there are any issues with the code.
Step 4: Set up an Amazon EventBridge rule to schedule the regular execution of the Lambda function at a specified frequency
Create an EventBridge rule to schedule the execution of the Lambda function created in the previous step. In the following example, we set the rule to execute the Lambda function every minute.
1. Open the EventBridge console, Select Rules and select Create rule.
2. Name: “SGWStateChecker”.
3. Select Rule type: Schedule.
4. Select Continue to create rule.
5. Select the option A schedule that runs at a regular rate, such as every 10 minutes, and select the Rate expression with a fixed rate of one minute. You can set the frequency of schedule as per your needs.
6. From Target 1, select AWS service, select Lambda function as target and choose the function named SGW-State-Checker.
7. Expand Additional settings, choose Constant (JSON text) from the drop down, provide the gateway ID and account number in the following format. Afterward, select Next and then Create rule to complete the creation of the rule.
{
"gateway_id": "sgw-2321C24A",
"account_number": "123456789101"
}
Select constant to view the input of the lambda function.
Step 5: Establish an Amazon SNS notification to alert the email address subscribed to the SNS topic
CloudWatch alarm continues to monitor the values set in the threshold, marks it in the alarm status when the value is lower than the threshold, and triggers a notification to the Amazon SNS target. Amazon SNS contains a list of target locations such as user email addresses that need to be notified when the Gateway goes offline.
1. Open the Amazon SNS console.
2. Select Create topic.
3. Select Type as Standard.
4. Provide the Name and Display name for the topic.
5. Create topic.
6. Select Access policy, choose Advanced, and add the following statement to the existing policy after updating the region and account ID. Provide the ARN of your SNS topic in this resource statement. Select Create topic to complete the creation of SNS topic.
{
"Sid": "SGWStateCheckerAlert",
"Effect": "Allow",
"Principal": {
"Service": [
"events.amazonaws.com",
"cloudwatch.amazonaws.com"
]
},
"Action": "sns:Publish",
"Resource": "arn:aws:sns:region:account-id:SGWStateChecker"
}
7. Navigate to the Subscriptions tab, select Create subscription, and select the Topic ARN and Protocol as Email-JSON to send an email notification when the CloudWatch metric is in the alarm state. Provide the email address in the endpoint section and select Create subscription. This sends an email to the email address that needs to be confirmed by the user to confirm the subscription. The following screenshot shows an email address that is confirmed and another one that is pending confirmation. This article doesn’t cover the SNS topic using encryption, as it is not within the scope of this discussion. It is recommended to setup AWS Key Management Service (AWS KMS) encryption with the CMK type key for EventBridge to send messages to the SNS topic.
Step 6: Create a CloudWatch metric alarm to notify SNS whenever Storage Gateway goes offline
The Lambda function creates the CloudWatch custom metric with values of either 1.0 or 0.0. The CloudWatch alarm is used to monitor this metric and notify the user through Amazon SNS or to take other action as needed. The alarm is after the first use of Lambda by the EventBridge rule. The CloudWatch alarm is set to trigger an SNS notification if the value is below 1 for 5 data points out of 5 requests.
1. Navigate to the CloudWatch console.
2. Select All alarms.
3. Select Create alarm and choose Select metric:
a) Choose StorageGatewayCustom from Custom namespaces.
b) Select the GatewayID, GatewayState dimension and then select the gateway ID in next step.
c) Choose Select metric to list information, as in the following image, and set the Period to 1 minute.
-
-
-
-
- This metric is available after the first execution of the Lambda function.
-
-
-
4. Select the threshold type as static and set the alarm condition to a value lower than one. Expand the Additional configuration and choose 5 data points out of 5 for alarm configuration. Specify Treat missing data as missing in Missing data treatment field.
5. Select Next and Configure actions for alarm notification. Select the SNS topic configured in Step 5. Upon selecting the SNS topic, it lists out the email endpoints associated with the topic. Also, add another notification for the Alarm state trigger and choose the same SNS topic to send notifications when the alarm transitions to the OK state.
6. Select Next and provide a name SGWStateCheckerAlarm for CloudWatch alarm. Select Next and Create alarm.
The configuration of the CloudWatch alarm makes sure that Storage Gateway is monitored and a notification is sent to the email address when it goes offline.
Deployment using CloudFormation template
CloudFormation is a service that allows you to define and manage AWS resources using a template. The steps described in the second section can be automated using this CloudFormation template. This template needs a subscription email ID and Storage Gateway ID as input parameters. The email ID provided in the subscription should be confirmed for an alarm notification to work. Deployment of this template should be done in the same Region as Storage Gateway. This template creates an IAM Role and associates it with a Lambda function, which creates a custom CloudWatch metric to monitor the Storage Gateway provided. A CloudWatch alarm is set for the default values provided in the preceding section.
Cleaning up
To remove all the components created by this solution and avoid future charges, complete the follow steps:
- Sign in to the AWS Management Console.
- On the CloudWatch console, choose All alarms, select SGWStateCheckerAlarm from Alarms, select Actions and Delete from dropdown.
- In the SNS console, select SGWStateChecker and select Delete to delete the SNS topic.
- On the EventBridge console, choose Rules, select rule SGWStateChecker and select Delete to delete the rule.
- Navigate to Lambda console, select the function SGW-State-Checker, select Actions and choose Delete to delete the Lambda function.
- If Cloudformation template is used to create this custom solution, navigate to the Cloudformation console, select the stack created, and select Delete to delete the resources created by this template.
Conclusion
In this post, we discussed a monitoring solution available for Storage Gateway using the default recommended CloudWatch alarms and set up a customized solution to monitor the availability of Storage Gateway by configuring CloudWatch custom metrics. Furthermore, we showed key CloudWatch metrics available for each type of Storage Gateway.
This solution provides an observability mechanism to continuously track the Storage Gateway status with automated alarms and anomaly detection. This automated monitoring and alerting mechanism provides a way to promptly notify users through channels such as emails, chatbots, text messages, or paging if the Storage Gateway goes offline unexpectedly. Rapid notifications facilitate responding to these outages faster to diagnose root causes and restore availability before business operations are disrupted significantly.
For further information about monitoring AWS Storage Gateway, visit the Storage Gateway page.