AWS Big Data Blog

How to build a front-line concussion monitoring system using AWS IoT and serverless data lakes – Part 2

August 2024: This post was reviewed and updated for accuracy.

In part 1 of this series, we demonstrated how to build a data pipeline in support of a data lake. We used key AWS services such as Amazon Kinesis Data Streams, Kinesis Data Analytics, Kinesis Data Firehose, and AWS Lambda. In part 2, we discuss how to process and visualize the data by creating a serverless data lake that uses key analytics to create actionable data.

Create a serverless data lake and explore data using AWS Glue, Amazon Athena, and Amazon QuickSight

As we discussed in part 1, you can store heart rate data in an Amazon S3 bucket using Kinesis Data Streams. However, storing data in a repository is not enough. You also need to be able to catalog and store the associated metadata related to your repository so that you can extract the meaningful pieces for analytics.

For a serverless data lake, you can use AWS Glue, which is a fully managed data catalog and ETL (extract, transform, and load) service. AWS Glue simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. As you get your AWS Glue Data Catalog data partitioned and compressed for optimal performance, you can use Amazon Athena for the direct query to S3 data. You can then visualize the data using Amazon QuickSight.

The following diagram depicts the data lake that is created in this demonstration:

Amazon S3 now has the raw data stored from the Kinesis process. The first task is to prepare the Data Catalog and identify what data attributes are available to query and analyze. To do this task, you need to create a database in AWS Glue that will hold the table created by the AWS Glue crawler.

An AWS Glue crawler scans through the raw data available in an S3 bucket and creates a data table with a Data Catalog. You can add a scheduler to the crawler to run periodically and scan new data as required.

Follow the below steps to create a database and a crawler in AWS Glue:

  1. In AWS Glue console select Databases and choose Add Database
  2. Give the database a name and choose Create database

Next, create a Glue crawler which will create a table in the database that you created above. A table consists of the names of columns, data type definitions, and other metadata about a dataset.

  1. From the left navigation in the, AWS Glue console, select Crawlers and then choose Create Crawler

After configuring the crawler, choose Finish, and then choose Crawler in the navigation bar. Select the crawler that you created, and choose Run crawler.

The crawler process can take 20–60 seconds to initiate. It depends on the Data Catalog, and it creates a table in your database as defined during the crawler configuration.

To view the created table, select Tables from AWS Glue console’s navigation pane. You can choose the table name and explore the Data Catalog and table:

In the demonstration table details, our data has three attributes – time stamp as event_time, the person’s ID as deviceid, and the heart rate as heartrate. These attributes are identified and listed by the AWS Glue crawler. You can see other information such as the data format (text) and the record count (approx. 15,000 with each record size of 76 bytes).

You can use Athena to query the raw data. To access Athena directly from the AWS Glue console, choose the table, and then choose View data on the Actions menu, as shown following:

As noted, the data is currently in a JSON format and we haven’t partitioned it. This means that Athena continues to scan more data, which increases the query cost. The best practice is to always partition data and to convert the data into a columnar format like Apache Parquet or Apache ORC. This reduces the amount of data scans while running a query. Having fewer data scans means better query performance at a lower cost.

To accomplish this, AWS Glue generates an ETL script for you. You can schedule it to run periodically for your data processing, which removes the necessity for complex code writing. AWS Glue is a managed service that runs on top of a warm Apache Spark cluster that is managed by AWS. You can run your own script in AWS Glue or leverage power of generative AI using Amazon Q to generate an ETL script. Follow below steps to generate ETL script using Amazon Q:

  1. Select ETL jobs in glue left navigation pane
  2. Choose Script editor under Create Job panel
  3. Choose Create script on the pop-up window
  4. On the Script editor view , choose Amazon Q on the right upper corner
  5. Next, we will use simple zero-shot prompting to generate an ETL script
  6. Here’s the complete generated script:
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    args = getResolvedOptions(sys.argv, ["JOB_NAME"])
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args["JOB_NAME"], args)
    
    # Script generated for node S3DataSource
    S3DataSource_dsource1 = glueContext.create_dynamic_frame.from_options(
    format_options={},
    connection_type="s3",
    format="json",
    connection_options={"paths": ["<my-path>"]},
    transformation_ctx="S3DataSource_dsource1",
    )
    
    # Script generated for node S3DataSink
    S3DataSink_dsink1 = glueContext.write_dynamic_frame.from_options(
    frame=S3DataSource_dsource1,
    connection_type="s3",
    format="parquet",
    connection_options={"path": "<my-path>", "partitionKeys": []},
    transformation_ctx="S3DataSink_dsink1",
    )
    
    job.commit()
  7. Copy this script to the script panel and update <my-path> in both S3DataSource and S3DataSink to mention the S3 path for source and target buckets
  8. We will need an IAM role to use in our ETL job. To do this, create an IAM with below mentioned policies. These policies allow the glue job to access source and target buckets
  9. Go to Job Details tab and give a name to this job and select the above created IAM role.
  10. Also enable Job bookmark. Enabling bookmark helps AWS Glue maintain state information and prevents the reprocessing of old data. You only want to process new data when rerunning on a scheduled interval.
  11. Save and choose Run Job.

It takes time to complete depending on the amount of data and data processing units (DPUs) configured. By default, a job is configured with 10 DPUs, which can be increased. A single DPU provides processing capacity that consists of 4 vCPUs of compute and 16 GB of memory.

After the job is complete, inspect your destination S3 bucket, and you will find that your data is now in columnar Parquet format.

Partitioning has emerged as an important technique for organizing datasets so that they can be queried efficiently by a variety of big data systems. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. For information about efficiently processing partitioned datasets using AWS Glue, see the blog post Work with partitioned data in AWS Glue.

You can create triggers for your job that run the job periodically to process new data as it is transmitted to your S3 bucket. For detailed steps on how to configure a job trigger, see Triggering Jobs in AWS Glue.

The next step is to create a crawler for the Parquet data so that a table can be created. The following image shows the configuration for our Parquet crawler:

Choose Create Crawler and then run this crawler .

Explore your database, and you will notice that one more table was created in the Parquet format.
You can use this new table for direct queries to reduce costs and to increase the query performance of this demonstration.

Because AWS Glue is integrated with Athena, you will find in the Athena console an AWS Glue catalog already available with the table catalog. Fetch 10 rows from Athena in a new Parquet table like you did for the JSON data table in the previous steps.

As the following image shows, we fetched the first 10 rows of heartbeat data from a Parquet format table. This same Athena query scanned only 4.39 KB of data compared to 10.15 KB of data that was scanned in a raw format. Also, there was a significant improvement in query performance in terms of run time.

Parquet data:

JSON data:

Visualize data in Amazon QuickSight

Amazon QuickSight is a data visualization service that you can use to analyze data that has been combined. For more detailed instructions, see the Amazon QuickSight User Guide.

The first step in Amazon QuickSight is to create a new Amazon Athena data source. Choose the heartbeat database created in AWS Glue, and then choose the table that was created by the AWS Glue crawler.

Choose Import to SPICE for quicker analytics. This option creates a data cache and improves graph loading. All non-database datasets must use SPICE. To learn more about SPICE, see Managing SPICE Capacity.

Choose Visualize, and wait for SPICE to import the data to the cache. You can also schedule a periodic refresh so that new data is loaded to SPICE as the data is pipelined to the S3 bucket.

When the SPICE import is complete, you can create a visual dashboard easily. The following figure shows graphs displaying the occurrence of heart rate records per device.

You can also use generative BI using Amazon Q to generate visualization for you. To do this, the first step will be to create a Q topic. You will need a user with either Author Pro or Administrator Pro permissions to be able to create a Q topic. For more information on user roles, see Amazon Q in QuickSight brings new user roles and pricing to Amazon QuickSight Enterprise Edition.

Once you have logged in with proper permissions, choose Topics from left navigation menu in QuickSight and select New Topic.

Select the previously created dataset as source for this topic. It might take few minutes for the topic to be ready. Once ready the status will show as 100% .

Now, you can choose the Create a new analysis using your dataset and this time use Q to create visualization.

In the below example, we have created a bar chart of heartrate by deviceids, by using a prompt “bar chart showing max heartrate by deviceid”.

Conclusion

Processing streaming data at scale is relevant in every industry. Whether you process data from wearables to tackle human health issues or address predictive maintenance in manufacturing centers, AWS can help you simplify your data ingestion and analysis while keeping your overall IT expenditure manageable.

In this two-part series, you learned how to ingest streaming data from a heart rate sensor and visualize it in such a way to create actionable insights. You also understood how to fast track your development experience leveraging power of Amazon Q. The current state of the art available in the big data and machine learning space makes it possible to ingest terabytes and petabytes of data and extract useful and actionable information from that process.

If you found this post useful, be sure to check out Work with partitioned data in AWS Glue, and 10 visualizations to try in Amazon QuickSight with sample data.


About the Authors

Saurabh Shrivastava is a partner solutions architect and big data specialist working with global systems integrators. He works with AWS partners and customers to provide them architectural guidance for building scalable architecture in hybrid and AWS environments.

Abhinav Krishna Vadlapatla is a Solutions Architect with Amazon Web Services. He supports startups and small businesses with their cloud adoption to build scalable and secure solutions using AWS. During his free time, he likes to cook and travel.

John Cupit is a partner solutions architect for AWS’ Global Telecom Alliance Team. His passion is leveraging the cloud to transform the carrier industry. He has a son and daughter who have both graduated from college. His daughter is gainfully employed, while his son is in his first year of law school at Tulane University. As such, he has no spare money and no spare time to work a second job.

David Cowden is partner solutions architect and IoT specialist working with AWS emerging partners. He works with customers to provide them architectural guidance for building scalable architecture in IoT space.

Josh Ragsdale is an enterprise solutions architect at AWS. His focus is on adapting to a cloud operating model at very large scale. He enjoys cycling and spending time with his family outdoors.

Pierre-Yves Aquilanti, Ph.D., is a senior specialized HPC solutions architect at AWS. He spent several years in the oil & gas industry to optimize R&D applications for large scale HPC systems and enable the potential of machine learning for the upstream. He and his family crave to live in Singapore again for the human, cultural experience and eat fresh durians.

Manuel Puron is an enterprise solutions architect at AWS. He has been working in cloud security and IT service management for over 10 years. He is focused on the telecommunications industry. He enjoys video games and traveling to new destinations to discover new cultures.

Rahul Singh is a Sr. Solutions Architect working with Global systems integrators. His expertise are on AI/ML and Data Engineering. He’s a happy camper in Chicago with his family and enjoys playing baseball with his kids


Audit History

Last reviewed and updated in August 2024 by Rahul Singh | Sr. Solutions Architect