AWS Storage Blog
Adapting to change with data patterns on AWS: The “curate” cloud data pattern
As part of my re:Invent 2024 Innovation talk, I shared three data patterns that many of our largest AWS customers have adopted. This article focuses on the “Curate” data pattern, which we have seen more AWS customers adopt in the last 12-18 months as they look to leverage data sets for both analytics and AI applications. You can also watch this four-minute video clip on the Curate data pattern if interested.
With Curate, instead of individual application owners developing and using their own data sets from aggregated data, the company standardizes on a few data sets that are used across the company. When I talked about the Aggregate data pattern, I talked about how AWS customers apply standards in a federated data ownership model. Curate takes that standardization a step further. Curate standardizes an organization on data products, often in conjunction with the company’s standardization on file formats, etc. Application developers discover their datasets using a business or technical catalog (versus going directly to data storage) or through an internal or external data marketplace for their application builders. By centralizing data usage to data products, it is easier to govern usage and audit for compliance because it narrows the data exposure for application builders to a subset of the total organization’s data.
This data pattern has really taken off in the last few years as companies look to leverage shared data sets in their data lake for analytics and ML/AI applications. For example, Cox Automotive Inc. builds data products from their S3 aggregated data. Those data products are used for a variety of business processes, from analytics to powering vehicle valuations in Kelley Blue Book or Autotrader, where more than 67% of automotive shoppers visit. Internal Cox Automotive application builders can get petabytes of data from the Cox Automotive Data Marketplace and use the data to build models that drive vehicle valuations and consumer insights applications. Cox Automotive says what makes the Curate data pattern successful is the quality of their data, the richness of the metadata in their data catalog so that they can find and build the right data products, and the hub of their Data Marketplace to manage distribution of the data products to application owners. You can learn more about Cox Automotive’s data journey here.
The AWS customers who have implemented the Curate data pattern say that in order to make great data products, you have to invest in your organization. When you use Curate, don’t think of data products as simply collecting data from different sources. If you are doing that, you are doing the Aggregate data pattern without the extra step of data curation. To build great data products, make sure you have product thinkers who start first from thinking through how the data is used by application builders and then work backwards from there to build a remarkable data product. That means you have to have the right talent in product management, data science, and data engineering to create the best quality data products for your business.
With Curate, you are centralizing ownership for the quality of your data. That means the rest of the organization gets broader leverage from your centralized data team. The team building your data products is the steward of your business-critical data and drives the data quality standard in the applications that use it. For example, the PGA TOUR uses Amazon Bedrock to retrieve and integrate both structured datasets (e.g. the right scoring, stats) and unstructured data (e.g. specific commentary, social media) to help fans closely follow players through real-time updates, narratives, predictions, alerts, betting, and multilingual commentary. These application experiences are powered by clean, curated data products that provide the latest data for the application experience.
Siemens centralizes their data in S3, which contains petabytes of data from over 300 sources, and is used to build data products across more than 70,000 internal data consumers. As in the Aggregate data pattern, different data providers are responsible for data ingestion into S3 and maintaining the technical metadata of their data sets, However, Siemens also uses Curate for specific data sets. Siemens built an internal data marketplace, which they call Datalake2Go, for consumers to find and explore data sets for their specific needs, like a researcher building an ML model to optimize rail operations. Before Siemens launched Datalake2Go, it took internal users weeks to find and contact the right owner for a set of data, followed by even more time to go through a manual process to request access. By self-serving curated data access for their application builders, they’re seeing big productivity gains. For example, with Datalake2Go, Siemens application developers were able to build an internal Intelligent Document Mapping application that takes a variety of documents and builds step-by-step instructions for engineers, including automatically analyzing the design, creating a bill of materials, and even enhancing the drawings. Before, their engineers at the factory did these things manually. Siemens is now saving more than 15k hours of work and almost 1 million euros per year using applications like Intelligent Document Mapping that are built from a few well curated data products.
To learn more about adding the Curate data pattern to your business, you can watch Moied Wahid, the Executive VP and CTO for Experian‘s Financial Services & Data Business, walk through how Experian started from an on-premises world, adopted the Aggregate data pattern, and then added Curate for data products. It’s a fascinating journey taken by a talented team, and Moied was kind enough to share Experian’s lessons learned along the way.
As I talked about in my introduction to the three data patterns, we have customers that implement one data pattern across the whole organization and other customers that apply a data pattern to specific teams. For example, customers often use Aggregate in fraud detection teams because ML data scientists want access to high volumes of diverse unprocessed data sets. We see customers gravitate to the Curate data pattern for shared data sets where the data quality is critical and requires processing. Common examples include AI researchers working with curated training datasets, developers of Knowledge Bases that are used in inference-driven applications, and targeted advertising. Curate also is helpful for highly governed industries or datasets (for example, datasets that contain personally identifiable information). And other customers take Curate a step further with the Extend data pattern.
This post is part of a series on adapting to change with data patterns on AWS: