Networking & Content Delivery

Network observability for modern applications

In today’s highly distributed and cloud-based IT environments, network monitoring has become crucial for organizations to maintain the health, performance, and security of their applications and infrastructure. However, as modern application architectures evolve, with multiple layers of abstraction and cloud-native services, many teams look for better ways to collect and use the high-quality network data required to inform critical business insights and decision-making.

This post explores how you can design effective network monitoring in the cloud using AWS services to collect and monitor and then analyze and work with network-level data to deliver valuable metrics and key performance indicators (KPIs). To help illustrate the concepts in this post, we describe a example scenario.

Sample scenario

Figure 1 shows an enterprise-scale organization with multiple applications that require guaranteed access to a central database and an on-premises mainframe, each is managed by different teams across separate AWS accounts. A central networking team oversees the connectivity between these accounts.

The database and mainframe need to support a 99.9 percent service level agreement (SLA) to the application teams, with a strict 50 millisecond (ms) latency requirement. The sensitive data in these systems also requires comprehensive access tracking and logging for 1 year, as mandated by the InfoSec team.

With this scenario in place, let us explore the concepts that drive network observability and a model that is helpful in designing a solution. Later sections in this post illustrate how all the concepts and techniques can be applied.

Figure 1: Architective diagram of the example scenario

Figure 1: Architecture diagram of the sample scenario

The role of network observability

As cloud computing technologies continue to evolve, the role of observability has expanded beyond traditional network monitoring. Monitoring focuses on collecting and reporting data points about the health of a system, which is a passive process. Observability expands on this. According to the AWS observability guide, observability is the “capability to continuously generate and discover actionable insights based on signals from the system under observation.” The shift to actionable insights moves observability into an active part of any architecture. For these insights to be valuable, they need to provide information that meets the needs of an organization. It is a best practice for organizations to build any observability goal around a KPI, SLA, or key business metric.

In cloud-based applications, there are often many logical layers that separate the application or business need from the underlying network infrastructure. Network observability paints a full picture of the performance, security, and cost-efficiency of these complex, distributed systems.

Concretely, a robust observability solution has the following beneficial impact on these aspects:

  • Troubleshooting and resiliency: Well-designed observability allows for fast issue resolution and self-healing of applications.
  • Performance tuning: Network metrics are valuable for understanding performance bottlenecks and optimizing workloads.
  • Security and governance: Comprehensive network controls are often required to meet compliance and security requirements. These controls must be monitored.
  • Cost management: Observing network data transfer costs is important for optimizing cloud spend.

It’s common for well-architected network observability designs to meet more than one of these goals. Crucially, you need to maintain a clear focus on the business purpose and desired outcomes when implementing any observability solution. This ensures that a solution will correctly prioritize how these points get addressed. The metrics and signals you prioritize should be directly tied to supporting your customers and delivering value, not just monitoring the infrastructure.

AWS design principles for network observability

The following design principles are recommended to consider when planning a network observability solution in AWS.

End-to-end visibility – Achieve comprehensive visibility into your cloud network to enable effective monitoring and troubleshooting. AWS provides services that capture network-level telemetry across your entire cloud environment, such as VPC Flow Logs and Amazon CloudWatch, these services give you a holistic view of your network’s health and performance.

Correlated insights – Network data is most valuable when analyzed in the context of broader system performance and business metrics. AWS makes it easy to correlate network telemetry with other observability data sources using services such as Amazon Managed Grafana and Amazon OpenSearch Service. As a result, your teams can quickly identify root causes, optimize resource utilization, and make data-driven decisions.

Seamless scalability – As your cloud environment grows, your network observability solutions must scale seamlessly. AWS Lambda and Amazon Kinesis provide serverless, event-driven capabilities that automatically scale to meet your increasing data processing and analytics demands, allowing you to focus on deriving insights from your network data.

Unified observability – Effective network monitoring in the cloud requires a holistic view that combines network-level data with application, security, and business intelligence. With AWS services such as OpsCenter, a capability of AWS Systems Manager; AWS X-Ray; and Amazon Athena, you can unify observability across your entire environment, helping your teams make data-driven decisions that optimize network operations and business outcomes.

Event-driven insight – In the fast-paced world of cloud computing, network issues and optimization opportunities require rapid response. Amazon EventBridge allows you to create rules that automatically trigger actions based on network-related events, enabling quick detection, investigation, and resolution of problems, as well as proactive optimization of your cloud environment.

The collect, monitor, analyze and act model

Figure 2 shows how the components of collect, monitor, and analyze and act fit together with many AWS services. When designing a solution, it is helpful to think about your design in these phases.

Figure 2: The components of the collect, monitor, and analyze and act model

Figure 2: The components of the collect, monitor, and analyze and act model

Collect

Network observability is the first phase in collecting metrics and logs. Metrics and logs are the raw data sources you use to build an observability system. Without reliable and robust data, any observability system is unlikely to succeed.

Monitor

The data observed through monitoring allows you to understand the current state of your network and application, identify performance bottlenecks or security vulnerabilities, and detect issues early before they impact your end users. CloudWatch dashboards at the figure’s center provide this monitoring and data aggregation capability. These dashboards are built from the data collected in the collect phase.

Effective monitoring is essential for network observability. The alarms and triggers you create during the monitoring phase feed into the next stage of the model. With these tools in place, you can identify and resolve problems in your network.

Analyze and act

Analysis and diagnosis are where customers spend the most time during an operational event or root cause analysis, which is the largest contributor to extended downtime. Understanding the right things to focus on is critical but remains difficult for many customers.

As shown in the diagram, AWS provides multiple tools for the analysis phase that help you focus on the right information for diagnosis and reduce mean time to repair (MTTR). For example, features such as Network Access Analyzer and Reachability Analyzer can assist in determining the impact of changes on your workload before deploying to production.

When an issue is detected, focusing on the right metrics and logs as quickly as possible enables quicker response to failures. AWS services like CloudWatch can be used for detecting functionality problems.

Once you identify the cause of a failure, you need to act, which may involve a short-term fix or patch, a rollback, or an architectural change. It’s best practice to automate your deployments and changes as much as possible to test them upfront and reduce configuration errors.

Performing post-event analysis for shared learning, identifying design gaps, and determining how to prevent the failure from recurring is also a best practice. Your goal should be to ensure the same issue does not re-occur, and if it does, to identify and remediate it automatically.

By incorporating the key elements illustrated in the collect, monitor, and analyze and act model, you can establish a comprehensive approach to collecting and monitoring your network data. Then, you can analyze or act on any events, using AWS services to optimize visibility and reduce mean time to resolution.

Bringing it all together

As described in the sample scenario at the beginning of this post, the key observability goals are:

  1. Provide end-to-end visibility to deliver near real-time data on network availability and latency to all teams.
  2. Use event-driven insights to quickly report when network performance drops below acceptable levels and trigger automated remediation.
  3. Collect network-level data that can be correlated with application logs to satisfy the InfoSec security requirements.

Collect

Figure 3 shows where collection takes place in the sample scenario. Implement Amazon Virtual Private Cloud (Amazon VPC) Flow Logs at key points to capture network traffic data, along with CloudWatch Logs and metrics on critical network components.

Architecture diagram showing where collection takes place

Figure 3: Architecture diagram showing where collection takes place

Monitor

In Figure 4, we use both reactive data sources (logs, metrics) and active monitoring tools (CloudWatch Synthetics, CloudWatch Internet Monitor, CloudWatch Network Monitor) to create dashboards, alerts, and events that track network performance against the defined SLAs and thresholds.

Figure 4: Architecture diagram of the monitor phase

Figure 4: Architecture diagram of the monitor phase

Analyze and act

Analyze and act on the collected data and monitoring capabilities to meet organizational goals. Here are some examples:

  1. For information on providing real-time visibility to all teams on network availability and latency, see Monitor hybrid connectivity with Amazon CloudWatch Network Monitor.
  2. To trigger automated notifications and remediations when performance issues are detected, see Using Amazon CloudWatch with Amazon ErventBridge for cross-account event monitoring.
  3. To archive network data for 1 year to support the InfoSec compliance requirements, see Publish flow logs to Amazon S3.

By aligning the network observability solution to your specific use case, your organization can implement comprehensive visibility, rapid issue detection and resolution, and compliance with security mandates—all while optimizing the performance and reliability of your cloud-based applications.

Aligning to your use cases

When establishing your network observability strategy, it’s important to consider the specific use cases that will shape your implementation. The architecture of your network will impact the tools and methods you use to achieve your observability goals.

Observability for connectivity within your AWS environment

This use case focuses on observing connectivity within your AWS environment. Because you have a greater degree of administrative control over the entire network, you can place monitoring closer to the source and destination of traffic, enabling complete visibility. Key areas to monitor include:

Observability for hybrid and internet connectivity

For hybrid cloud environments, you need to observe connectivity between your AWS resources and on-premises data centers and branch offices. Observing your internet-facing workloads is crucial for understanding end user experience. These use cases often include network paths you cannot directly monitor or manage. Later sections in this post show how to resolve this. Key areas to monitor include:

Observability for application networking

Newer application patterns like serverless or containerized architectures have created a new use case for your network observability strategy. Because much of this network traffic is being generated by short-lived devices, you must integrate your network monitoring with the control plane that manages these workloads. Key areas to monitor include:

Observability for network security

Observing your network security posture is essential for protecting your workloads. The need to monitor network security is often an additional use case with its own requirements. It may be helpful to meet this need on a parallel path. Key areas to monitor include:

  • Workload segmentation, ingress, egress, and east-west traffic using AWS Network Firewall, AWS WAF, and Gateway Load Balancer
  • Security control points, including both AWS services and partner firewall appliances

Conclusion

Effective network observability is critical for cloud-based applications. Using AWS observability services, organizations can gain comprehensive visibility, correlate network-level data, and use scalable, unified observability. By aligning the observability strategy to use cases like connectivity, networking, and security, organizations can achieve faster troubleshooting, optimize performance, enhance security, and make data-driven decisions to support their evolving cloud architectures.

When you are ready to implement network observability in your environment and want to explore how your use case can be addressed reach out to your AWS account team.

About the authors

Wayne Geils

Wayne Geils

Wayne is a Senior Solutions Architect at Amazon Web Services (AWS), where he works with education technology providers. He has 25 years of experience in technology and infrastructure, and has helped customers manage their AWS journeys since 2016. With a diverse background ranging from Systems Administrator to CTO, Wayne remains passionate about the role of technology in bridging the present with the possibilities of the future.

Sohaib Tahir

Sohaib Tahir

Sohaib is a Principal Solutions Architect and a technical leader at Amazon Web Services (AWS) for the US state and local government finance and administration team. He has more than 14 years of experience in the technology and engineering space and has helped customers deliver AWS powered solutions since 2015. Sohaib specializes in designing mission critical systems in the cloud such as tax, unemployment insurance, retirement systems and others. He works with government agencies globally to help deliver on their mission using cloud technologies.