AWS for M&E Blog

Improve your viewers’ live streaming experience with Media-Quality Aware Resiliency

Architecting a highly resilient solution for streaming premium live sports events, concerts, or news is critical to delight viewers with a high quality of experience (QoE). Deploying your video delivery workflow in two different Amazon Web Services (AWS) Regions is one way to provide extra redundancy. Region failover can mitigate impacts to workflows and even withstand the unlikely event of an partial or complete region failure.

An additional reason to switch between redundant video streams is when a quality degradation is detected in the live video originated from the venue to one of the regions. Traditionally, such quality degradation is detected by a dedicated video expert who maintains “eyes-on-glass” (also known as confidence monitoring) to catch quality issues and switch to a secondary stream. This manual process can take several minutes, during which viewers might miss a climactic moment in a sporting event or an important news item. The longer it takes to recover, the higher the impact to QoE which can eventually lead dissatisfied viewers to cancel subscriptions.

Today we launched Media Quality-Aware Resiliency (MQAR), an integrated capability between Amazon CloudFront and AWS Elemental Media Services. MQAR provides automated, cross-region origin selection and failover based on a dynamically computed media quality score. With MQAR, your “eyes-on-glass” is always on, detecting video quality issues and switching between redundant video streams automatically in matter of seconds.

Always on “eyes-on-glass” for high quality of experience

With AWS, live event streaming starts at the venue with video delivered onto the cloud for processing and distribution to viewers at scale. Measuring QoE is critical to understanding how viewers perceive video on their playback devices, including mobile, PC, connected TV, and VR headset devices. The most common metrics customers use to evaluate QoE include video startup time, buffering ratio, video play failure, bitrate, video resolution, and viewer engagement.

These metrics reflect the playback performance on the client player, which can be impacted by latency or availability of the stream, but not the underlying video quality. For example, if the video is distorted, frozen, or playing black frames, there won’t be any indication in the QoE metrics.

To enable a level of quality control that inspects the video signal health from the origin, you can now use Media-Quality Aware Resiliency (MQAR) which provides:

  • Dynamic, cross-region origin selection based on a video quality score
  • Seamless media stream switching with cross-region packaging endpoints
  • Continuous monitoring of video quality
  • Analysis of each video frame to detect common quality problems
  • Detecting problems at an early stage in the workflow to recover early, avoiding a large blast radius of quality degradation
  • Visibility and alerting on the diagnosed problems, to facilitate debugging and manual mitigations when required

Architecture for live event delivery at scale with Media Quality-Aware Resiliency

You can enable MQAR on top of a cross-region video delivery workflow. Your architecture can use a single input stream per region, or two input streams per region for a higher resiliency and less switching between regions.

This diagram shows the end-to-end workflow architecture leveraging two AWS regions, and highlights the MQAR-related mechanisms at the MediaLive, MediaPacxkage and CloudFreont levels.

Figure 1: MQAR in a redundant input, redundant pipeline channel per region.

In this walkthrough we will focus on the MQAR origin selection.

  1. Starting from the venue, you can use two redundant encoders (in this case AWS Elemental Live appliances) for delivering your live event feed to AWS Elemental MediaConnect in two regions, and then to AWS Elemental MediaLive.
  1. Your MediaLive standard channels process two video inputs and continuously generate a Media Quality Confidence Score (MQCS) based on the detection of media quality issues at the adaptive bitrate (ABR) encoder level. This score reflects the perceived anomalies a given viewer-facing stream contains, compared to an ideal state where the contribution source is clear of defects and the encoding process is steady. The MQCS is passed to the packager/origin AWS Elemental MediaPackage in a Common Media Server Data (CMSD) header with each HTTP PUT request in the ingest phase.
  1. Next, your MediaPackage channels constantly inspect the MQCS of each of their two ingest pipelines and selects the pipeline that reports the best quality score, allowing for in-region quality-based resiliency. This requires you to use epoch locked Common Media Application Format (CMAF) ingest, so that the switches between inputs are seamless from a content timeline continuity perspective. MediaPackage originates the video stream and signals the MQCS to the output, also in CMSD header, so all downstream services that are fetching the origination stream can get the MQCS information through a GET or a HEAD request.
  1. When you enable MQAR in your CloudFront distribution, CloudFront constantly compares the MQCS from the two synchronized MediaPackage channels located in different AWS Regions. It dynamically selects the MediaPackage origin that reports the best quality score for the channel. This allows for cross-region quality-based resiliency. When Amazon CloudFront Origin Shield is used in combination with third-party content delivery networks (CDNs), all downstream CDNs automatically get the media segments with the best possible quality. This provides customers with the additional benefit of cross-CDN quality-based resiliency.
  1. Viewers from all CDNs receive the video stream with the highest quality to their media client, for the best possible quality of experience.

Configure your live event delivery workflow with Media Quality-Aware Resiliency enabled

The following configuration recommendation applies to your video pipeline workflow deployment in two AWS Regions for redundancy. To determine which AWS Regions support the AWS Media Services, please refer to the AWS Services by Region list.

AWS Elemental MediaLive configuration

MediaLive automatically generates a MQCS when publishing to a MediaPackage V2 channel configured with a CMAF Input type. Follow the instructions in the MediaLive User Guide to configure a CMAF Ingest output group.

A closer look at the MQCS in MediaLive

MQCS fundamentally differs from other video quality analysis (VQA) approaches as it’s not comparing the input signal with the encoded output signal in order to estimate the perceived video quality at the viewer level. Instead, MQCS focuses on detecting the most common quality problems that can happen in video contribution and encoding workflows.

These problems are the source of “grey failures”, as they don’t result in availability errors on the player side (the live stream is not stale, all media segments are available), but impact the visual experience after video decoding, usually in a sporadic way. These issues can be categorized in five groups: source bitstream problems, elementary stream problems, error concealment, error recovery, and segments errors.

With the initial launch of MQCS support, MediaLive can synthesize a simple quality score for each media segment based on the detection of six types of problems:

  1. Fill frames, resulting from the encoder repeating a frame due to input loss
  2. Dropped frames, resulting from a video buffer drop or decode problems
  3. Continuity counter errors, resulting from network packet loss
  4. Black frames, potentially resulting from production or contribution problems
  5. Freeze frames, usually resulting from production or contribution problems
  6. Speed versus quality (SVQ) encoding problems, resulting from the encoder reducing its encoding quality to maintain real-time operations

Each of these problems is an input to the MQCS synthesizing algorithm. When no problems are detected the MQCS will be 100. When one or more of these problems are detected, the MQCS will be downgraded to potentially zero, in proportion of the duration and severity of the issue. For example, if an input loss is detected, the score will drop to zero, as we are certain that the output quality will be degraded. However, with frozen frames the score might be downgraded to 25, as the output is still more useful than black video frames.

The algorithm is tuned to avoid false positives and reflect actual problems with short observation periods. The scoring algorithm is very efficient in terms of resource usage and doesn’t generate latency on the output streams.

AWS Elemental MediaPackage configuration

You will need to configure two identical MediaPackage channels leveraging CMAF Ingest in two different regions. This is a requirement for both the baseline cross-region failover and for MQAR dynamic origin selection. Make sure to configure your endpoint error behavior as described to achieve the resiliency on different error scenarios. Also, validate that your CloudFront Distribution can failover to your backup MediaPackage Origin.

For MQAR, you will need to confirm that the two configuration options are checked in the “Media Quality Confidence Score (MQCS) settings” section of your MediaPackage V2 channel. These options are available only when you select CMAF as Input type. They should be activated by default on all channels leveraging CMAF Ingest, and can be disabled if needed through the API/SDK or console.

This screen capture shows the MediaPackage channel configuration options to enable input switching based on MQCS and to signal the MQCS score in the output CMSD headers.

Figure 2: AWS Elemental Media Package CMAF configuration.

Quality-based input failover in MediaPackage

By default, MediaPackage selects the healthiest ingest pipeline, meaning the most complete and latency-free in terms of segment availability. When the “Enable input switch based on MQCS” option is activated and when both ingest pipelines are healthy, MediaPackage will also take the incoming MQCSs into consideration. It will switch to the ingest pipeline presenting the best scores across all ingested renditions, down to the granularity of a single ingest segment duration. That allows MediaPackage to discard defective segments from the outputs, using a just-in-time packaging quality approach.

MQCS publishing in CMSD

When the “Enable MQCS publishing in Common Media Server Data (CMSD)” option is activated, MediaPackage signals the MQCSs on the output endpoints. It uses multiple CMSD keys for each individual segment (in a similar way to MediaLive), for each track type, and for each sequence. This last variant aggregates and averages the scores of all segments of a given media sequence number. The MQCS Sequence score is leveraged by CloudFront to assess the score of a particular origin. Since the MQCS Sequence scores are aggregated across all renditions, they are more stable than the score of individual segments, which reduces the risks of origin flapping. They also allow CloudFront to make more responsive origin selection decisions, as less data points are needed at the CDN level to assess the overall quality score of a particular channel.

Note: Packaging/origination fees are charged only for GET requests to the media that your MediaPackage is serving. You will not incur origination charges for HEAD requests to your MediaPackage channel, which respond with the MQCS value in the CMSD header. If for some reason you don’t want to surface the quality scores in CMSD for this channel, you should turn off this option.

Support for most of the CMSD keys that an origin can add to its request responses has been added, including: streaming type, streaming format, intermediary identifier, object type, encoded bitrate, held time, object duration and publish time. These CMSD keys carry a lot of value, as they can be leveraged by monitoring systems, QoE platforms, and log processing solutions to refine alerts, log analysis, or root cause analysis with media-centric input parameters.

Amazon CloudFront configuration

For highly resilient live event delivery, follow the configuration to create your CloudFront distribution for live streaming with MediaPackage.

Create Origin group with selection based on quality score

Create two AWS Media Services pipelines in two different regions, and add your Elemental MediaPackage V2 origins in a cross-region deployment to your distribution. You can now create an origin group and configure your failover and selection criteria. To do this:

  1. Open the CloudFront console
  2. Choose the distribution that you want to create the origin group for
  3. On the Origins tab, in the Origin groups window, choose Create origin group
  4. Enter a name for the origin group
  5. Choose the MediaPackage v2 origins and use the up/down arrows to set the priority for the origins—primary and secondary
  6. Choose the HTTP status codes to use as your failover criteria
  7. To use MQAR with this origin group, under Origin selection criteria, select Media quality score for Enabling origin selection based on quality score (see Figure 3)
This screen capture shows the CloudFront origin group configuration required to activate origin selection based on MQCS scores.

Figure 3: CloudFront origin group configuration, MediaPackage origins in different regions, failover, and origin selection criteria enabled.

Enabling MQAR in CloudFront cache behavior for your MediaPackage channel delivery

When you create a cache behavior for your channel delivery in CloudFront, select the Origin Group you created as your origin for the cache behavior. Copy the path pattern provided when you created the Channel Group in your Elemental Channel group endpoint (see Figure 4).

This screen capture shows the selection of a MQAR-enabled origin group in CloudFront's cache behavior configuration.

Figure 4: Select origin group as your origin in your CloudFront cache behavior.

 

Now, when your CloudFront distribution forwards the GET request to your primary origin, CloudFront also sends a HEAD request to your Secondary origin for each request. CloudFront compares the quality score received from each origin, and makes a decision about which origin is likely to serve the higher quality. CloudFront uses a back off time after the score from the primary origin is stable, allowing for a seamless return to fetch the stream from the primary origin. This is done automatically and there is no additional configuration needed.

Note: CloudFront tracks the MQCS sequence scores for each cache behavior that has an origin group enabled for MQAR. You can use different cache behaviors for different channels with the same origin group. Alternatively, you can create multiple origin groups using the same MediaPackage origins and assign different origin groups to your cache behavior. Both options allow you to use MQAR with multiple channels. Refer to CloudFront quota for the number of origin groups limit for each distribution.

Observability

Once the MQAR solution is enabled across MediaLive, MediaPackage and CloudFront, it will automatically orchestrate the in-region and cross-region failovers based on quality scores. Each service provides observability features that can be used together to trace the source of the problem or to validate the origin selection decisions.

CloudFront MQAR log fields

You can use CloudFront near real-time logs to track when CloudFront selects the secondary origin over the primary origin due to quality score changes with these three new fields enabled:

  1. r-host: Emitted for Origin requests, this indicates the domain of the origin server used to serve the object. In case of errors, this will show the last origin attempted. For example: cd8jhdejh6a.mediapackagev2.us-east-1.amazonaws.com
  2. sr-reason: Provides a reason why the origin was selected. It’s empty when a request to the primary origin succeeds. If origin failover occurs, the field will contain the HTTP error code that led to the failover, such Failover:404 or Failover:502. In case of origin failover, if the retried request also fails and you have not configured custom error pages, then r-status indicates the response of the second origin. However, if you have configured custom error pages along with origin failover, then this will contain the response of the second origin if the request failed and a custom error page was returned instead. If no origin failover occurs but MQAR origin selection occurs, then this will be logged as MediaQuality.
  3. x-edge-mqcs: Indicates the MQCS range (0-100) for media segments that CloudFront retrieved in the CMSD response headers from MediaPackage v2. This field is available for requests matching a cache behavior that has an MQAR-enabled origin group. CloudFront logs this field for media segments that are also served from its cache in addition to origin requests.

MediaPackage MQAR metrics

When an extended MQCS degradation is detected in CloudFront real-time logs, customers can continue tracing back to the corresponding MediaPackage channel. The deployment region of the channel is visible in the r-host log field, while the channel group and channel are part of the request URL. MediaPackage exposes Amazon CloudWatch metrics allowing you to understand what’s happening for a particular channel:

  • For origination endpoints, the aggregated score is exposed in the ChannelMQCSSequence metric. A more detailed score for each track type is exposed in ChannelMQCS. These two metrics should reflect the score degradation that triggered the origin selection switch, at a per minute level resolution.
  • For ingest endpoints, similar ChannelMQCSSequence and ChannelMQCS will reflect the degradation of the MQCS respectively at the segment level and the media sequence number level, on one ingest endpoint or both. The identification of the problematic ingest endpoint allows you to conduct further investigations at the upstream MediaLive channel level.

MediaLive MQAR log fields and metrics

In MediaLive, the corresponding MQCS degradation is reflected in:

  • CloudWatch Logs: A log message will be emitted whenever the MQCS drops below 80.
  • CloudWatch Metrics: All input parameters to the MQCS calculation is exposed as distinct metrics, as well as the synthesized MQCS for each rendition and the minimum MQCS across all renditions.

On top of these logs and metrics, MediaLive emits CloudWatch Events for SVQ, fill frame insertion, and black frames that can be leveraged for alerting purposes. All MQCS-related alerts are also surfaced in the MediaLive console. You can set CloudWatch alarms based on thresholds for your MediaLive quality indicators or your existing log analysis tool, and correlate quality degradations that trigger an origin switch.

Conclusion

As live media consumption continues to move from broadcast to over-the-top, the need for broadcast-grade resiliency and quality assurance increases every day. We explained how you can leverage Media Quality-Aware Resiliency in AWS Media Services and Amazon CloudFront to automate your “eyes-on-glass” operations and minimize the duration of disruption events.

Contact an AWS Representative to learn how we can help accelerate your business.

Further Reading

Tal Shalom

Tal Shalom

Tal Shalom is a Sr. Solutions Architect helping companies accelerate their adoption of cloud-based solutions.

Nicolas Weil

Nicolas Weil

Nicolas Weil is a Principal Product Manager for AWS Elemental.