AWS Business Intelligence Blog
Use Amazon QuickSight level-aware calculations to analyze COVID-19 datasets
You can use the advanced functionalities in Amazon QuickSight to analyze data at different dimensions and get granular, actionable insights from your data. QuickSight also enables you to achieve this without having to worry about the complexity with data preparation. With QuickSight level-aware calculations (LAC), users including business analysts, data scientists, and decision-makers can dynamically aggregate and analyze values at a granularity other than what is displayed at the visual level. You can specify the level of granularity for the calculations and use different LAC functions to further analyze your dataset across different granularities and reduce time spent on preprocessing the data, such as in data deduplication use cases.
In this post, we analyze COVID-19 datasets and use the QuickSight LAC feature, specifically LAC-A functions, to gain a better understanding of the impact of the virus across different regions in the United States, and explore relationships between demographic groups, gender, age groups, symptom statuses, and more, and their impact against COVID-19 cases.
Level-aware calculations
These are two types of LAC functions: level-aware calculation – aggregate (LAC-A) and level-aware calculation – window (LAC-W).
LAC-A functions are newly launched functions that act as an extension to regular expressions such as count
, sum
, min
, max
, and average
. LAC-A functions don’t require you to learn new functions; instead, it requires the addition of a grouping level as the additional attribute. When performing calculations at the granularity specified, LAC-A functions create an internal result set at the grouping level and handle the visual-level aggregation implicitly. Therefore, when the sum function is applied, you don’t have to worry about duplication or double counting. See the following example:
LAC-W functions are an existing group of window functions, such as sumover()
, maxover()
, and minover(),
that require partition and calculation levels in the function. With LAC-W functions, the results from the PRE_AGG and PRE_FILTER components change at which step the LAC-W functions work so that users can leverage filters or perform it at the raw data level. Duplications may occur in this case and it’s important to resolve the duplication such as using the min
aggregation function. The following is an example syntax:
When choosing when to use LAC-A vs. LAC-W functions, you need to consider the QuickSight order of evaluation, or the order of when different granularities are being calculated. Depending on the use case, you might use a LAC-A function to perform aggregation at the end or use a LAC-W pre-aggregation function to perform a calculation at an earlier stage. For more information, refer to Create advanced insights using level-aware calculations in Amazon QuickSight.
In the following sections, we demonstrate how to analyze COVID-19 data with LAC-A functions in different use cases.
Dataset 1: COVID-19 Case Surveillance Public Use Data
In the first four examples, we use the COVID-19 Case Surveillance Public Use Data from the CDC, which contains data including the county and state of residence, gender, demographics, the month that the case was reported, and other patient-level data, as shown in the following screenshot.
Use case 1: Compare cases across states and gender vs. total cases across all genders
In this example, we want to better understand the distribution of COVID-19 cases for females or males in a state compared to the total case count for all genders across the states. The options for the gender column include Female, Male, Missing, NA, Other, and Unknown. We first create a visual-level aggregation to show the groupings of the fields at the visual level, then we analyze it with a LAC-A function that shows the groupings at the calculation level.
For the visual-level aggregation, we select the dimensions state (res_state
) and gender (sex
). We also select the measure case_month
and we are using it as a count field because it is non-nullable and we are counting the rows of case instances with (count(case_month)
). We also sort the table by state to see the number of cases across each gender category, as shown in the following screenshot. For the example with Alaska (AK), we see that the distribution of female and male case counts is relatively similar, with 139,592 total cases vs. 143,109 cases, respectively. We also see that 18,596 cases don’t have the gender information available, under NA.
For the LAC-A function, we want to view the total COVID-19 case counts grouped by the resident state only, to view the difference in the output with the visual-level aggregation. We create the new LAC-A function to group the total cases by res_state
:
Because the groupings for the LAC-A function are fixed and defined in the function, the calculation process doesn’t consider the grouping of gender at the visual level to provide different results, as shown in the following screenshot. With the LAC-A function, the total count of cases in Alaska is 302,397, and this is the same across every gender option. This allows us to compare the case count for each gender category to the total number of cases.
We also add an additional calculation to find the percentage proportion of the count of cases across each gender category compared to the total case counts in that state. We add the sum function to be able to perform the division as QuickSight doesn’t support custom functions with both aggregated and non-aggregated fields. In Alaska, we see the percentage case count of female to be 46.16% compared to the total number of cases.
When focusing only on the cases for California (CA) in a pie chart, we can see that 52% of the total case counts belongs to the gender female and 2% are shown as unknown.
Use case 2: Analyze total case counts in each state grouped by age group with total cases observed for each age group
In the following example, we first find the total case counts grouped by each state and age group by dragging the attributes into the field wells for the visual-level aggregation. To find the COVID-19 cases in each age group regardless of the state that they are in, we add the following LAC-A function:
We also calculate the percentage of cases in each age category within a state to the total number of cases in each age category nationwide. We again add the sum function to be able to perform the division since it contains both aggregated and non-aggregated fields.
count({case_month})/ sum(count({case_month}, [{age_group}]))
Upon adding this calculated field, we sort the table by states, as shown in the following screenshot, and focusing on Massachusetts (MA), Maryland (MD), and Maine (ME). We can see an overview of cases in each state grouped by age group, along with the total case counts in each age group in the US and their percentage proportion. In Massachusetts, the number of cases for patients that are between 18–49 years old shows the highest proportion of cases, with 2.25%, 1,135,614. This is in comparison to a total of 50,395,252 cases across all states in the same age group. Maryland and Maine show a similar trend, with the largest number of cases in the 18–49 age group but demonstrate a relatively smaller percentage of cases. Maryland represents 1.62% (817,449) and Maine represents 0.29% (144,762) compared to the total case count.
Use case 3: Compare cases across states with different symptom statuses
With the COVID-19 dataset, there is a distribution of cases across states with patients reporting four different symptom status types: Asymptomatic, Symptomatic, Missing, and Unknown. To compare the number of cases across individuals reporting a symptomatic or asymptomatic status for their COVID-19 record vs. the other symptom statuses, we use two nested LAC-A functions with conditions.
For the first function, we compare the symptomatic cases with the case counts across other symptom status. We use the ifelse statement to specify when the symptom_status
is equal to Symptomatic, return the value of the count of case_month
, and group by res_state
. If the symptom_status
is not equal to Symptomatic, then return null. See the following code:
For the visual-level aggregation, we group the visual by res_state
and symptom_status
and add the count of case_month
as a value in addition to the LAC-A function we just created. When focusing only on the cases in New York (NY) and New Jersey (NJ), we can see that in New York, there are more cases recorded as Missing or Unknown for symptom status compared to the 2,047,961 symptomatic cases. In New Jersey, the number of asymptomatic cases is significantly lower at 304 cases compared to the total symptomatic case count of 1,317,894.
To perform the same type of analysis against cases that are shown as asymptomatic, we use a second LAC-A function with conditions:
When we apply it to the same states, we can now see the large proportion of cases that are recorded as Symptomatic, Missing, or Unknown compared to the cases that are recorded as Asymptomatic in New York.
Use case 4: Calculate median and average case reporting intervals across states
In this analysis, we will use the average case reporting interval case_positive_specimen_interval
, which is the time in weeks when the specimen is collected and the time when there is a positive test confirmation. We will compare the efficiency of the COVID-19 testing process across different states with the counties within these respective states. States and counties with lower intervals indicate a more efficient and rapid testing and reporting process. Comparatively, higher intervals could signal potential delays or inefficiencies in the testing pipeline. These insights are crucial for public health officials and policymakers in directing resources and efforts to improve testing and reporting infrastructure where needed.
To get the comparison between the average case reporting interval across the different states grouped by county, we will first select at the visual level, the dimensions county (res_county
), state (res_state
), and the average measure of the reporting interval (case_positive_speciman_interval
). We then add the following LAC-A function below to only aggregate the results based on states.
In our example, we are focusing only on the metrics for Florida and we are also sorting the average reporting interval in a descending fashion to show the counties with the longest case reporting interval. We can see that the average reporting interval in Florida is 0.153 weeks. However, in Taylor, the average interval is 1.003 weeks, which is the highest across all the other counties in Florida. On the other hand, the county with the lowest reporting interval is Charlotte with a value of 0.016 weeks. We also add the conditional formatting that highlights counties with reporting intervals greater than the state average, 0.153 as red and less than or equal to 0.153 as green.
Dataset 2: COVID-19 US states and demographics
To demonstrate other ways of using LAC-A functions, we use the us-states dataset and us-demographics dataset, which include up-to-date COVID-19 total cases and deaths across the US states, demographic groups, and hospitalization statuses. We have joined the two datasets as shown in the following screenshot, which shows the accumulated data across different dates for each of the states.
Use case 5: Deduplicate multiple entries of cases
Because the dataset contains multiple entries of total cases for each state that were recorded at different dates, we want to ensure that we’re getting an accurate total case count per state without duplicating data for the states.
The visual-level aggregation cases(Sum)
shows duplicate data that includes multiple entries from each of the states because the cases
field is a running sum of cases across all time in the dataset. In this case, we can use the LAC function that includes a max function to group the maximum value of total cases by states. This removes any duplicates without having to perform preprocessing of the data in advance. See the following code:
Use case 6: Compare state and demographic case averages
Our goal in this analysis is to gain insight into the pandemic’s disparities and broad impacts by assessing average case numbers with the total case count numbers. We compare the average case numbers of each demographic group at the state level, the total demographic case count, the overall average at the state level, and the total national case count.
The first step is to determine the average number of cases grouped by states. To do so, we calculate the maximum cases from each state and average them:
We then calculate the average number of cases for each demographic group. We accomplish this by identifying the maximum of each demographic, grouping them by state, and then calculating their average. We start with the average cases for the black demographic group in the following example and repeat that all for demographic groups:
After we’ve used the LAC-A function to calculate the average, we add the fields case_demographic
and us_cases
, then aggregate both using the max aggregation method, which allows us to show the highest case values in comparison to the LAC-A averages. In the following screenshot, we can now view the average_case for each demographic against each other and compared to the average cases in the US.
Conclusion
As we navigate through the complex landscape of data analytics to understand the impact of the COVID-19 pandemic, level-aware calculations in QuickSight have emerged as a pivotal tool in our toolbox. In this post, we dove deep into the multifaceted applications of LAC-A functions to illustrate how you can efficiently aggregate and compute data at various levels aside from the visual level. This enables the generation of comprehensive, nuanced insights that are frequently overlooked in conventional analysis approaches. The QuickSight LAC-A functions were particularly remarkable due to their capacity to transform intricate data into practical and understandable insights, all while avoiding the need for complicated data preprocessing.
Throughout the post, we discussed multiple use cases where LAC-A functions provide significant insights, such as producing comparative analyses of COVID-19 cases across resident states, county, gender, age group, and demographics at different levels of aggregation. We also analyzed the disparities in COVID-19 testing and reporting efficiencies between states to inform public health strategies and demonstrated the ability to deduplicate data.
In conclusion, QuickSight level-aware calculations, particularly LAC-A functions, empower you to dive deeper into your datasets and uncover insights that drive informed decisions by analyzing them at different granularities. You can also use Amazon Q in QuickSight to add generative business intelligence (BI) capabilities to automatically build, discover, and share actionable insights and narratives using intuitive natural language.
If you’re interested in applying QuickSight level-aware calculations for your use case, reach out to your AWS account team. You can also visit the QuickSight Community to ask, answer, and learn with others about QuickSight, and explore the newest features and resources.
About the Authors
Aileen Zheng is a Solutions Architect supporting US Federal Civilian Sciences customers at Amazon Web Services (AWS). She partners with customers to provide technical guidance on enterprise cloud adoption and strategy and helps with building well-architected solutions. She is also very passionate about data analytics and machine learning. In her free time, you’ll find Aileen doing pilates, taking her dog Mumu out for a hike, or hunting down another good spot for food! You’ll also see her contributing to projects to support diversity and women in technology.
Joel Asante, an Austin-based Solutions Architect at Amazon Web Services (AWS), works with GovTech(Government Technology) customers. He is passionate about data analytics, machine learning, robotics, and using his skills to create secure and scalable cloud architectures for his customers. Joel holds 9 AWS certifications and enjoys family time, fitness, and cheering for the Kansas City Chiefs and Los Angeles Lakers in his spare time.