Networking & Content Delivery
Automating connectivity assessments with VPC Reachability Analyzer
If your network architecture is complex, and you’d like to quickly identify application connectivity issues due to infrastructure changes, then the new Amazon Virtual Private Cloud (VPC) Reachability Analyzer can help. Often times, it is not always clear if changes to VPC infrastructure are affecting connectivity to applications and other AWS services. By implementing automated reachability assessment using Reachability Analyzer, application issues due to connectivity problems are detected quickly. This allows you to mitigate problems before major application outages occur. Reachability Analyzer allows you to evaluate reachability, or network connectivity, between two endpoints in a VPC (that is, an Elastic Compute Cloud (EC2) instance and an Internet Gateway (IGW)), or multiple VPCs.
In this post, we will demonstrate an automated method to verify network connectivity between VPC elements after an infrastructure change is made, and alert administrators in the event reachability has been affected. By implementing this automated solution, prolonged outages related to connectivity issues as a result of infrastructure changes can be mitigated.
Solution overview
Scenario
In this blog post, the automated reachability assessment solution is designed around an EC2 instance acting as a web server. This web server must always be reachable on port 80 and 443 from the public internet. To achieve this desired state of connectivity, the web server must be:
- Placed in a public subnet in a VPC with an IGW attached
- A default route must be created in the route table assigned to the public subnet with the destination as the IGW of the VPC.
- Network Access Control Lists (NACL) and security groups must allow access from the public internet to the instance on TCP ports 80 and 443.
If any of these VPC elements are misconfigured, the webserver will not have connectivity from the internet on ports 80 and 443. Identifying the VPC connectivity properties that are required in order to achieve connectivity is the starting point for an automated reachability assessment solution. For this blog post, we will focus on detecting security group changes that cause connectivity to the webserver to fail.
Solution
When a security group change is made, the change event is logged in AWS CloudTrail. CloudTrail then forwards the change event to Amazon EventBridge, which evaluates the change against a series of rules to determine if any actions must be taken. Within EventBridge, a rule will be created to forward all security group change events from CloudTrail to an AWS Lambda function. The Lambda function is responsible for determining if any EC2 instances are impacted by the security group change, and if any Reachability Analyzer paths assessing the connectivity from the internet to the instance exist. If so, the analyses will be restarted. If any of the analyses fail, a message is published from the Lambda function to an Amazon Simple Notification Service (SNS) topic. By implementing this architecture, AWS administrators will quickly be notified if a change in the network infrastructure causes connectivity to fail.
Prerequisite infrastructure
For the purposes of this blog post scenario, there are several infrastructure items that must be in place:
- A VPC with an IGW attached. Record the IGW ID for use later.
- A public subnet with a default route to the IGW in the applicable VPC route table.
- A security group that allows all HTTP and HTTPS traffic from any source IP address. Record the security group ID for use later (that is, sg-xxxxxxxx).
- A single t2.micro EC2 instance in the public subnet with a public IP address assigned. Record the instance ID for use later (that is, i-1234567890abcdef0).
Once these resources are created, the infrastructure for automated reachability assessment must be created. This CloudFormation template can be used to create the entire infrastructure of this blog post, including the prerequisite infrastructure.
Overview of steps
- Create Reachability Analyzer paths
- Create the SNS topic and subscription for automated notifications
- Create reachability assessment Lambda code used to restart reachability assessment
- Create Lambda function
- Create an EventBridge rule to trigger Lambda function
- Trigger the automated reachability assessment
1. Create Reachability Analyzer paths
To perform automated reachability assessment, network paths must be manually created using Reachability Analyzer. As the instance in this blog post is acting as a webserver, paths must be created on ports 80 and 443, sourced from the IGW of the VPC, and destined to the webserver instance.
aws ec2 create-network-insights-path \
--source-ip "0.0.0.0" \
--source <IGW ID> \
--destination <Instance ID> \
--protocol tcp \
--destination-port 80
Step 1.1: Create a Reachability Analyzer path from the AWS Command Line Interface (CLI). This path verifies the webserver instance is reachable on port 80 from the public internet.
aws ec2 create-network-insights-path \
--source-ip "0.0.0.0" \
--source <IGW ID> \
--destination <Instance ID> \
--protocol tcp \
--destination-port 443
Step 1.2: Create a second Reachability Analyzer path from the AWS CLI. This path is identical to the first except it verifies connectivity on port 443.
2. Create the SNS topic and subscription for automated notifications
When an instance fails reachability assessment, a notification will be published to an SNS topic. Any subscribers to the topic will be notified in turn. In this section, a new SNS topic is created along with an email subscription. Any subscribers to the topic receive an email when a message is published.
aws sns create-topic \
--name <TOPIC NAME>
Step 2.1: Create a new SNS Topic. Record the Amazon Resource Name (ARN) of the new topic from the response to the command.
aws sns subscribe \
--topic-arn <TOPIC ARN> \
--protocol email \
--notification-endpoint <EMAIL ADDRESS>
Step 2.2: Create a subscription to the SNS topic. For email subscriptions, the user must confirm subscription to the topic.
3. Create reachability assessment Lambda code used to restart reachability assessment
The Lambda function for automated reachability assessment contains several pieces:
- The security group and event type are extracted from the event forwarded to the Lambda function by EventBridge.
- EC2 instances that have the affected security group attached are discovered.
- Reachability Analyzer paths originating from an IGW and terminating at an affected EC2 instance are discovered.
- Reachability Analyzer paths are re-analyzed considering the new change. Any instances that fail to pass reachability assessment are published to the SNS topic.
These Python code snippets outline the core pieces of the reachability assessment Lambda code. The full Lambda code can be found here. The code makes use of the boto3 package provided by AWS.
def get_security_group_id(event):
return event.get("detail").get("requestParameters").get("groupId")
def check_security_group_event_name(event):
return event.get("detail").get("eventName") in security_group_events
The get_security_group_id and check_security_group_event_name functions extract the impacted security group from the EventBridge event and verify that the event is applicable to the Lambda.
def get_affected_ec2_instaces(ec2_session, security_group_id):
instances = ec2_session.describe_instances(
Filters=[
{
'Name': 'network-interface.group-id',
'Values': [
security_group_id
]
}
],
MaxResults=100
)
affected_instances = []
for reservation in instances.get('Reservations'):
for instance in reservation.get('Instances'):
affected_instances.append(instance.get("InstanceId"))
return affected_instances
Once the security group has been determined, the get_affected_ec2_instances function is called. All instances in the AWS account with the security_group attached are retrieved through the boto3 describe_instances call.
Note: only the first 100 matching instances are returned in this example as a result of the ‘MaxResults’ parameter in the describe_instances call. To consider all instances, pagination must be implemented using the ‘NextToken’ parameter returned in the response to the API call.
This logic can be altered to fit different scenarios. For example, if the Lambda function was invoked because of a routing table change, the get_affected_ec2_instances function would instead search for EC2 instances residing in subnets associated with the affected routing table.
def get_affected_reachability_analyzer_paths(ec2_session, affected_instances):
impacted_network_insights_paths = []
network_insights_paths = ec2_session.describe_network_insights_paths()
for instance_id in affected_instances:
for network_insight_path in network_insights_paths.get(
'NetworkInsightsPaths'):
if (
network_insight_path.get('Destination') == instance_id and
network_insight_path.get('Source').startswith('igw-')
):
impacted_network_insights_paths.append({
'instance_id': instance_id,
'network_insights_path_id': network_insight_path.get('NetworkInsightsPathId'),
})
return impacted_network_insights_paths
If there are any affected EC2 instances, the get_affected_reachability_analyzer_paths function is called. This function retrieves all Reachability Analyzer paths through the boto3 describe_network_insights_paths API call. The if-statement within the for loop provides the logic for the automated reachability assessment. The conditions in this statement determine the Reachability Analyzer paths that need to be re-analyzed given the change which triggered the Lambda function. In the case of a webserver, any paths sourced from an IGW and destined to one of the affected EC2 instances on ports 80 or 443 must be re-analyzed.
The logic in the get_affected_reachability_analyzer_paths function can be adopted to suit different scenarios. This could also be used for other scenarios, such as when an instance must be accessible using Remote Desktop Protocol from a bastion host in a different subnet. In order for this connectivity to be successful, routing tables must be present and configured to route traffic between the two subnets, and security groups and network ACLs must be configured to allow access to the instance from the bastion host only on TCP/UDP port 3389. In this case, the conditions described in the preceding function would be changed to search for Reachability Analyzer paths sourced from the bastion host and destined to EC2 instances that should always be accessible by the bastion host.
def start_network_insights_analysis(ec2_session, network_insights_paths):
for index, network_insights_path in enumerate(network_insights_paths):
response = ec2_session.start_network_insights_analysis(
NetworkInsightsPathId=network_insights_path['network_insights_path_id']
)
if response.get('NetworkInsightsAnalysis').get('Status') == 'running':
network_insights_paths[index].update(
{
'status': response.get('NetworkInsightsAnalysis').get('Status'),
'network_insights_analysis_id': response.get('NetworkInsightsAnalysis').get('NetworkInsightsAnalysisId')
}
)
if not any(network_insights_path for network_insights_path in network_insights_paths if network_insights_path.get('status') == 'running'):
instance_ids = list(map(lambda path: path.get('instance_id'), network_insights_paths))
raise RuntimeError(
f'Failed to start Network Insights analysis for any affected instances: {instance_ids}'
)
return list(
filter(lambda path: (path.get('status') is not None and path.get('status') == 'running'),
network_insights_paths)
)
After the Reachability Analyzer paths that need to be re-analyzed have been determined, the start_network_insights_analysis function is called. This function starts a new analysis for each of the paths by calling the start_network_insights_anaysis function. The function also adds the status property to the array of objects used to track which instances must have reachability re-assessed.
def get_network_insights_results(ec2_session, network_insights_paths, context):
completed_analyses = 0
while (
completed_analyses < len(network_insights_paths) and
context.get_remaining_time_in_millis() / 1000 >= 2
):
for network_insights_path in network_insights_paths:
if (
network_insights_path.get('status') == 'succeeded' or
network_insights_path.get('status') == 'skip'
):
continue
if context.get_remaining_time_in_millis() / 1000 < 2:
break
try:
analysis = ec2_session.describe_network_insights_analyses(
NetworkInsightsAnalysisIds=[network_insights_path.get(
'network_insights_analysis_id')]
)
if not len(analysis.get('NetworkInsightsAnalyses')) > 0:
network_insights_path.update({
'status': 'skip'
})
completed_analyses += 1
continue
if analysis.get('NetworkInsightsAnalyses')[0].get('Status') == 'succeeded':
completed_analyses += 1
network_insights_path.update({
'status': analysis.get('NetworkInsightsAnalyses')[0].get('Status'),
'analysis_result': analysis
.get('NetworkInsightsAnalyses')[0]
.get('NetworkPathFound')
})
except botocore.exceptions.ClientError:
network_insights_path.update({
'status': 'skip'
})
completed_analyses += 1
continue
if context.get_remaining_time_in_millis() / 1000 >= 2:
if not (
all(network_insights_path.get('status') == 'succeeded' or network_insights_path.get(
'status') == 'skip' for network_insights_path in network_insights_paths)
):
time.sleep(3)
else:
break
return network_insights_paths
After reachability analysis has been started for each network path, the get_network_insights_results function is called. This function checks the results of the reachability analysis performed by Reachability Analyzer. For each analysis started, Reachability Analyzer is polled until there is a result, or less than two seconds remain before Lambda timeout. Once the reachability analysis is completed, the status property of the instance object in the array of affected instances is updated along with the results of the analysis. As the reachability analysis is asynchronous, the analysis may not be completed on first check. As a result, the boto3 API call describe_network_instance_analysis is placed inside a polling function. If there are any remaining incomplete analyses after checking all analyses once, the function will sleep for three seconds before fetching the results again. This will continue until there are fewer than two seconds remaining until Lambda timeout, or all assessments complete. The array of instance objects is returned from the function.
Note: A Reachability Analyzer assessment may take longer than the Lambda function timeout to complete. This blog post does not address this scenario. AWS Step Functions could be added to the architecture to verify all analyses completed; if not, Step Functions could be configured to restart the function and continue the polling.
def send_sns_notification(failed_paths, unknown_status_paths,
sns_session, sns_topic_arn, security_group_id):
message = ""
if len(failed_paths) > 0:
message += "The following instances: "
message += f"{', '.join(list({failed_path.get('instance_id') for failed_path in failed_paths}))} "
message += "did not pass reachability assessment after security group "
message += f"{security_group_id} was updated.\n\n"
if len(unknown_status_paths) > 0:
message += "The following instances: "
message += f"{', '.join(list({unknown_status_path.get('instance_id') for unknown_status_path in unknown_status_paths}))} "
message += "did not complete reachability assessment after security group "
message += f"{security_group_id} was updated."
if message != "":
sns_session.publish(TopicArn=sns_topic_arn, Message=message)
return
For any instances that fail reachability assessment, or could not have their reachability status determined, the send_sns_notifications function is called. This function generates a message containing the instance IDs that failed or did not complete reachability assessment. A message is then published to the SNS topic for distribution to all subscribers.
4. Create Lambda function
Once the code has been completed and saved in a Python file, a deployment package is created. The deployment package contains the application code and any dependencies. Once created, the package is uploaded and deployed as a Lambda function. In order to create the package, Python, pip, and a compression utility are necessary on the development machine.
pip install --target ./package boto3
Step 4.1: As Reachability Analyzer is a new AWS service, the commands have not yet been added to the boto3 package provided by the Lambda runtime. As a result, the latest boto3 package is included in the deployment package as a dependency along with the function code. This command installs the boto3 package in a local folder called ‘package’.
cd package/
zip -r ../reachability_assessment.zip .
cd ..
zip -g reachability_assessment.zip app.py
Step 4.2: Zip the package directory and add the Python file containing the function code to the archive.
Before deploying the code to AWS Lambda as a Lambda function, a role must be created that grants the function necessary permissions. This function needs full access to Reachability Analyzer and the SNS publish permission. To grant these permissions to a Lambda function, an Identity and Access Management (IAM) role will be created, and then policies attached to the role.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"lambda.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}
Step 4.3: Place the JSON block in a file and save it as trust-policy.json. Record the ARN returned from the command for use later.
aws iam create-role \
--role-name ReachabilityAssessmentLambdaRole \
--assume-role-policy-document file://trust-policy.json
Step 4.4: Create the IAM role for the Lambda function using the AWS CLI. Be sure to note the ARN from the output returned by the command.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:GetTransitGatewayRouteTablePropagations",
"ec2:DescribeTransitGatewayPeeringAttachments",
"ec2:SearchTransitGatewayRoutes",
"ec2:DescribeTransitGatewayRouteTables",
"ec2:DescribeTransitGatewayVpcAttachments",
"ec2:DescribeTransitGatewayAttachments",
"ec2:DescribeTransitGateways",
"ec2:GetManagedPrefixListEntries",
"ec2:DescribeManagedPrefixLists",
"ec2:DescribeAvailabilityZones",
"ec2:DescribeCustomerGateways",
"ec2:DescribeInstances",
"ec2:DescribeInternetGateways",
"ec2:DescribeNatGateways",
"ec2:DescribeNetworkAcls",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribePrefixLists",
"ec2:DescribeRegions",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcEndpoints",
"ec2:DescribeVpcPeeringConnections",
"ec2:DescribeVpcs",
"ec2:DescribeVpnConnections",
"ec2:DescribeVpnGateways",
"ec2:DescribeVpcEndpointServiceConfigurations",
"elasticloadbalancing:DescribeListeners",
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeLoadBalancerAttributes",
"elasticloadbalancing:DescribeRules",
"elasticloadbalancing:DescribeTags",
"elasticloadbalancing:DescribeTargetGroups",
"elasticloadbalancing:DescribeTargetHealth",
"tiros:CreateQuery",
"tiros:GetQueryAnswer",
"tiros:GetQueryExplanation",
"ec2:CreateTags",
"ec2:DeleteTags",
"ec2:StartNetworkInsightsAnalysis",
"ec2:DescribeNetworkInsightsAnalyses",
"ec2:DescribeNetworkInsightsPaths"
],
"Resource": "*"
}
]
}
Step 4.5: Place the JSON block in a file and save it as ec2-permissions.json
. This policy will grant Reachability Analyzer access to the VPC elements necessary to perform path analysis. Additional information about Reachability Analyzer permissions can be found in the Required API permissions for VPC Reachability Analyzer documentation entry.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sns:Publish"
],
"Resource": "<SNS_TOPIC_ARN>"
}
]
}
Step 4.6: Place the JSON block in a file and save it as sns-permissions.json
. This policy will grant the Lambda function permission to publish to the SNS topic created earlier. Replace <SNS_TOPIC_ARN>
with the ARN of the SNS topic created earlier.
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“logs:CreateLogStream”,
“logs:CreateLogGroup”,
“logs:PutLogEvents”
],
“Resource”: “*”
}
]
}
Step 4.7: Place the JSON block in a file and save it as cloudwatch-permissions.json
. This policy will grant the Lambda function permission to publish logs to Amazon CloudWatch Logs.
aws iam put-role-policy \
--role-name ReachabilityAssessmentLambdaRole \
--policy-name ReachabilityAssessmentEC2Permission \
--policy-document file://ec2-permissions.json
aws iam put-role-policy \
--role-name ReachabilityAssessmentLambdaRole \
--policy-name ReachabilityAssessmentSNSPermission \
--policy-document file://sns-permissions.json
aws iam put-role-policy \
--role-name ReachabilityAssessmentLambdaRole \
--policy-name ReachabilityAssessmentCloudWatchPermission \
--policy-document file://cloudwatch-permissions.json
Step 4.8: Attach the policies to the role using the AWS CLI.
aws lambda create-function \
--function-name ReachabilityAssessment \
--runtime python3.7 \
--zip-file fileb://reachability_assessment.zip \
--handler app.lambda_handler \
--role <ROLE_ARN> \
--description “Lambda function for automated reachability assessment.” \
--timeout 60 \
--environment “Variables={SNS_TOPIC_ARN=<SNS_TOPIC_ARN>}”
Step 4.9: Deploy the Lambda function using the AWS CLI. Replace the <SNS_TOPIC_ARN>
placeholder with the ARN of the SNS topic created earlier. Replace the <ROLE_ARN>
with the ARN of the Lambda role created earlier. This command will deploy the code to AWS Lambda with the role created in this section, a timeout of 60 seconds, and a single environment variable. Note the FunctionArn in the returned output for use later (that is, arn:aws:lambda:<region>:<account>:function:ReachabilityAssessment
).
5. Create EventBridge rule that triggers the Lambda function
Amazon EventBridge is a rules-based engine that triggers actions based on events received from AWS services. One of the supported AWS services that interacts with EventBridge is CloudTrail. As user management activities are performed, CloudTrail will deliver the events to EventBridge where actions are taken based on a set of rules. In this case, any security group changes will trigger the automated reachability assessment Lambda function. To accomplish this, an EventBridge rule must be created which matches all security group change events. Once matched, the rule will forward the event to the reachability assessment Lambda function.
{
"detail": {
"eventName": [
"AuthorizeSecurityGroupIngress",
"AuthorizeSecurityGroupEngress",
"RevokeSecurityGroupIngress",
"RevokeSecurityGroupEgress"
]
}
}
This rule will be created within EventBridge. This rule matches any security group change event from CloudTrail. Only a single event in the ‘eventName’ array must be present in the event delivered from CloudTrail in order for the rule to be considered matched.
aws events put-rule \
--name "SecurityGroupChangeRule" \
--event-pattern "{\"detail\": {\"eventName\": [\"AuthorizeSecurityGroupIngress\",\"AuthorizeSecurityGroupEngress\",\"RevokeSecurityGroupIngress\",\"RevokeSecurityGroupEgress\"]}}" \
--state ENABLED \
--description "Matches any security group change event from Cloudtrail"
Step 5.1: Create the EventBridge rule using the AWS CLI.Record the ARN returned in the response to this command for use later.
aws events put-targets \
--rule "SecurityGroupChangeRule" \
--targets "Id"="1","Arn"="<LAMDBDA FUNCTION ARN>"
Step 5.2: Add the Lambda function as a target for the rule created in Step 1. Replace <LAMBDA FUNCTION ARN>
with the ARN of the function created in the ‘Create Lambda function’ section.
{
"Effect":"Allow",
"Action":"lambda:InvokeFunction",
"Resource":"<LAMBDA_FUNCTION_ARN>",
"Principal":{
"Service":"events.amazonaws.com"
},
"Condition":{
"ArnLike":{
"AWS:SourceArn":"<EVENT_BRIDGE_RULE_ARN>"
}
}
}
Finally, EventBridge requires permission to invoke the Lambda function. This resource-based policy must be applied to the Lambda function to grant EventBridge invoke access.
aws lambda add-permission --statement-id "InvokeLambdaFunction" \
--action "lambda:InvokeFunction" \
--principal "events.amazonaws.com" \
--function-name "<LAMBDA_FUNCTION_ARN>" \
--source-arn "<EVENT_BRIDGE_RULE_ARN>"
Step 5.3: To apply the resource-based policy to the Lambda function, run the add-permission command from the AWS CLI. Replace <LAMBDA_FUNCTION_ARN>
with the FunctionArn recorded earlier and the <EVENT_BRIDGE_RULE_ARN>
with the EventBridge rule ARN recorded earlier.
6. Trigger the automated reachability assessment
Once the infrastructure is in place, all security group changes will trigger the reachability assessment Lambda function. Instances that have the impacted security group attached will have their connectivity re-assessed using the already defined Reachability Analyzer paths. In this case, Reachability Analyzer verifies that the webserver instance is reachable from the public internet on TCP ports 80 and 443 after a security group change. If the instance fails either assessment, a message will be published to SNS.
aws ec2 revoke-security-group-ingress \
--group-id <SECURITY GROUP ID> \
--protocol tcp \
--port 443 \
--cidr 0.0.0.0/0
Step 6.1: To begin, the security group created in the prerequisite infrastructure section will be modified to revoke access on port 443 from the internet. This change will cause one of the Reachability Analyzer path analyses to fail, and a message will be published to the SNS topic. Shortly after, an email will arrive in the subscriber’s inbox describing the instances that have failed reachability assessment.
aws ec2 authorize-security-group-ingress \
--group-id <SECURITY GROUP ID> \
--protocol tcp \
--port 443 \
--cidr 192.168.1.0/24
Step 6.2: Next, the security group is modified to restore inbound access on port 443, however, only hosts with an IP address in the 192.168.1.0/24 subnet are accepted. As a result, the reachability assessment will fail as the instance is not reachable from the internet, and another email is sent to the subscriber’s inbox.
aws ec2 authorize-security-group-ingress \
--group-id <SECURITY GROUP ID> \
--protocol tcp \
--port 443 \
--cidr 0.0.0.0/0
Step 6.3: Finally, the security group is modified to restore inbound access on port 443 for all traffic. The reachability assessment Lambda still runs as a result of this change, however, no email is sent as the reachability requirements are satisfied.
Cleanup
Remove the EventBridge rule, Lambda function, and SNS topic to avoid incurring extra costs. If you used the sample VPC infrastructure, remove the EC2 instance, IGW, subnet, and VPC as well.
Conclusion
As shown by this example, automated reachability assessment is a tool that you can use to verify that AWS resources retain their desired connectivity after infrastructure changes. By implementing an automated reachability assessment solution powered by Reachability Analyzer, you can be confident that infrastructure changes will not cause connectivity issues and outages—any connectivity issues that are the result of network infrastructure changes can be quickly mitigated.
Try it yourself with the CloudFormation template:
https://github.com/aws-samples/amazon-vpc-reachability-analyzer-automated-analysis
Get Started with VPC Reachability Analyzer:
https://docs.thinkwithwp.com/vpc/latest/reachability/getting-started.html