AWS Partner Network (APN) Blog
How hc1 Turns Lab Data into Personalized Health Insights Using AWS Serverless
By Gokhul Srinivasan, Sr. Partner Solutions Architect, Startup – AWS
By Whitney Wilger, Sr. Data Engineer – hc1
hc1 |
Since 2011, hc1 has emerged as a bioinformatics leader in precision testing and prescribing. The hc1 platform built on Amazon Web Services (AWS) organizes live data, including lab results, genomics, and medications, to deliver solutions ensuring the right patient gets the right test and right prescription at the right time.
Most hc1 customers are healthcare systems and independent laboratories that store data across disparate systems. An AWS Healthcare Competency Partner, hc1 ingests, organizes, and normalizes customer data to deliver analytics and improve operations management. As an outcome, customers use these insights to their fullest potential.
hc1 achieves this using the hc1 Lab Insights Platform, which includes:
- hc1 Operations Management: Streamlines multiple areas of laboratory operations, from sales activities to customer and patient relationships to operations initiatives.
- hc1 Analytics: Provides automated reporting and key performance indicator (KPI) tracking in real time.
The data from the above solutions are classified into account, provider, and patient profiles that streamline complex healthcare relationships. In addition, each profile contains lab data attributes comprising orders, results, cases, tasks, and memos.
Prior State
hc1 used a Pentaho-based solution to process customer data and create the analytics and reports. The Pentaho data integration suite and user interface components were deployed across Amazon Elastic Compute Cloud (Amazon EC2) instances and caused operational challenges.
- Infrastructure was fragile and required manual management. The deployment and change management could have been more programmer-friendly.
- Architecture did not support integration with AWS CloudFormation to improve the DevOps efficiency and build an automation pipeline.
- Data spread across data systems causing monolithic data silos:
- MySQL on Amazon EC2: Transactional data from hc1 CRM platform.
- MySQL on Amazon Aurora: Transactional data from the laboratory information system.
- Postgres on Amazon EC2: Audit data across all hc1 platforms.
- Postgres on Amazon Aurora: FHIR HL7 messages.
- MySQL is the primary source for this process, however the lake stores data from MySQL, Postgres, and Amazon DynamoDB.
Solution
The approach was to build a multi-tenant, scalable architecture that addressed these operational challenges while improving ownership and accountability. After evaluating options, hc1 transformed into a next-generation architecture powered by AWS serverless services, such as AWS Glue and AWS Lake Formation.
AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Lake Formation is a fully managed service that helps build, secure, and manage data lakes, and provides fine-grained access control for data in the data lake.
Alternate approaches were costly and created management overhead while also requiring dedicated EC2 instances and 24/7 support. AWS Glue consumes resources when called upon, delivering faster data transfer, less lag time, and is less expensive.
For hc1’s internal data teams, the resulting Amazon Simple Storage Service (Amazon S3)-based data lake provides a centralized repository to source ML initiatives. AWS Glue enables hc1 to offer a more robust product to deliver data daily and in a timely fashion.
Architecture
The source MySQL and Amazon Aurora databases are multi-tenant, storing data across all customers. The customer data is stored in distinct tables without overlap, and the architecture breaks the process into independent processing segments and isolates the blast radius.
Below is the illustration of how the serverless architecture shown in Figure 1 would work. The architecture is split into three groups.
Raw Data Generation
This step classifies and extracts data from the source databases and moves them into S3. This step uses AWS Glue crawlers to scan data across the databases, extract schema information, and store the metadata in the AWS Glue Data Catalog.
AWS Glue Data Catalog stores the customer metadata and uses permissions from Lake Formation to safely publish data while protecting data access in a granular manner. This helps track the schema changes and build a comprehensive audit and governance process.
There are five AWS Glue extract, transform, load (ETL) jobs that transform the data and produce the output raw file in parquet format:
- AWS Glue schema sync: Keeps Snowflake databases (analytics store) in sync with the MySQL source.
- Full load: Loads the entire customer table.
- Incremental load: Loads the incremental changes from customer table.
- Dynamic full load: Loads the entire user-defined table for the customer.
- Dynamic incremental load: Loads incremental changes from user-defined table.
These purpose-built AWS Glue jobs isolate the flow to efficiently handle different business scenarios. In addition, some user-defined tables are large. The dynamic load jobs handle this volume independent of the incremental and full load jobs.
At the end of this step, the customer-specific raw data is isolated based on Lake Formation access control and moved to respective S3 buckets for data curation.
Data Curation
This process helps with the organization and integration of the raw data. The transformation provides a meaningful way to store reporting data by pivoting columns to rows. This process is decoupled using AWS Lambda, Amazon Simple Notification Service (SNS), and Amazon Simple Queue Service (SQS). The decoupling helps hc1 with the needed agility to support frequent customer changes and scalability to onboard new customers.
AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. SNS is a fully managed messaging service for both application-to-application (A2A) and application-to-person (A2P) communication. SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications.
Amazon S3 event notification is used to trigger notification when raw data files are added to a specific S3 bucket. The notification configuration identifies the events and notifies SNS, and the SNS topic sends the message to the subscribed SQS queues.
You can use a Lambda function to process messages in an SQS queue. Lambda polls the queue and invokes your function synchronously with an event that contains queue messages. You can specify another queue to act as a dead-letter queue for messages that your Lambda function can’t process.
This process splits into a curation and transformation sequence and the SNS topic notifies the respective SNS queue. Apart from the queue, each sequence contains a Lambda function, and a dead-letter queue for messages that the Lambda can’t process.
The curated and transformed files are then stored in separate S3 buckets. Curated files contain data from normalized tables, while transformed files contain changes to those tables such as denormalization and pivots.
Curation to Snowflake
The final step in this process employs an AWS Glue job, CuratedToSnowflake, which creates the report. The job ingests the files from the curated and transformed S3 buckets and produces the report data for lab insights.
The data is pushed to Snowflake through the Snowflake admin API and a client database inside Snowflake.
Figure 1 – hc1 data lake ingestion architecture.
The AWS Glue jobs support custom data movement and improved operational stability. The process splits data into batches and uses AWS Glue bookmarks, which help hc1 maintain state information, persist previous state supporting idempotent transactions, and prevent reprocessing of old data.
Amazon DynamoDB is a fast, flexible NoSQL database service for single-digit millisecond performance at any scale. The architecture uses DynamoDB to store AWS Glue bookmarks, processing status at the table and database sources level.
The architecture uses AWS CloudFormation and AWS Serverless Application Model (AWS SAM) to build the serverless application. CloudFormation lets hc1 model, provision, and manage AWS and third-party resources by treating infrastructure as code. SAM provides shorthand syntax to express functions, APIs, databases, and event source mappings.
Using a data-driven, loosely coupled architecture, hc1 isolated the operation of the upstream and downstream platforms. The architecture is built on top of an existing application, avoids data duplication, and ensures high standards for data security and governance. This reduces overall friction for data flow within the hc1 platform.
Outcomes
Overall, the architecture adds better logging and altering, decreases blast radius, and improves resilience by breaking the process into three stages against one holistic option. The outcome is a single tenant software-as-a-service (SaaS) offering with one tenant per customer. The AWS Glue jobs are deployed in each customer tenant, and Lake Formation is multi-tenant supporting all customers.
Through this architecture, hc1 modernized its data platforms with AWS-native technologies that are highly scalable, feature-rich, and cost-effective. This approach enables hc1 internal teams to operate autonomously while providing central data discovery, governance, and auditing of the upstream and downstream applications.
hc1 can also integrate faster, implement efficiently, and quickly scale to meet internal and customer demands. This approach enables governance and easy data movement adhering to compliance and regulatory policies. Using the serverless architecture, hc1 avoided data loss, improved data sharing, improved security, and increased return on investment (ROI). This allows hc1 to turn lab data into personalized healthcare insights with speed and agility at scale.
Using the new architecture enables hc1 with three distinct, yet related, outcomes:
- Multi-modal analysis
- Lab insights
- Data security, governance, and compliance
Serverless Advantage
The architecture helps hc1 scale to thousands of active customers and focus on customer outcomes and quality improvement.
The multi-AWS Availability Zone (AZ) implementation makes the solution highly available, fault tolerant, and scalable, while minimizing the operational cost based on actual usage.
Key advantages include:
- Scalability: Scaled to support multiple customers, the architecture handles over 71 TB of data across multiple customers.
- Resilience: Pentaho process moves data from source to destination in one giant step. The new architecture breaks out the steps and provides easy recovery with multi-AZ implementation.
- Operational improvement: Pentaho began as all full loads, changing over to differential only when full loads would no longer process for large customers. The new approach selects the incremental and full load types based on the columns and further separates the dynamic tables.
- Eliminate dependency: The workload uses built-in AWS service integration and avoids dependency on third-party platforms, training, and upgrades. It removes management overhead with EC2 and Pentaho software.
- Cost optimization: With a pay-as-you-go model, hc1 optimized cost and never had to over-provision resources. This cost saving is above the cost reduction from third-party licenses.
- Eliminate provisioning delays: Now hc1 can scale and add more customers without capacity planning and provisioning delays.
- Audit: Access controls are pre-defined using AWS Lake Formation and deploys AWS Glue changes. This simplifies HIPAA and Hi-Trust auditing, and creates visibility to audit data.
Customer Benefits
The ability to run frequently and incremental loads help hc1 meet the runtime service-level agreement (SLA), thus improving customer satisfaction. The overall solution helps hc1 activate customers within a shorter duration, at a lower cost, and with an improved customer experience.
Lab diagnostic data and operational metrics often reside in several different isolated systems. Customers now have the ability to generate automated quality reports and key performance indicators (KPI) in real time, eliminating delays.
For hc1, this architecture provides a repeatable blueprint to integrate new domains and applications. Customers can also design and use user-defined fields and tables to add customer-specific data. Separate AWS Glue jobs support this customer-defined data processing. Lab insights are delivered faster in near real-time, helping labs innovate and deliver analytics-driven outcomes, improving patient health.
Customers enjoy the flexibility with the user-defined tables a necessary step from the previous process. This enables customers by shifting dependency on product feature development. At present, the architecture handles over 117 GB of user-defined data, and this volume will continue to grow with increased customer adoption.
Conclusion
The AWS serverless architecture adopted by hc1 enhances its customer experience, delivering personalized health insights. The approach helps hc1 to aggregate data from monolithic silos and improve efficiency through AWS CloudFormation and a DevOps automation pipeline.
AWS Glue and AWS Lake Formation break the process into independent and resilient processing units and isolate the blast radius improving the platform reliability. Building on this foundation, hc1 can drive more innovation and analytics-driven solutions.
To learn more about how hc1 can help healthcare professionals transform lab data into personalized healthcare insights, visit the hc1 website.
hc1 – AWS Partner Spotlight
hc1 is an AWS Healthcare Competency Partner that ingests, organizes, and normalizes customer data to deliver analytics and improve operations management. As an outcome, customers use these insights to their fullest potential.