AWS Big Data Blog

How BMW streamlined data access using AWS Lake Formation fine-grained access control

This post is cowritten with Ruben Simon and Khalid Al Khalili from BMW.

BMW’s ambition is to continuously accelerate innovation and improve decision-making across their global operations. To achieve this, they aimed to break down data silos and centralize data from various business units and countries into the BMW Cloud Data Hub (CDH). The CDH is used to create, discover, and consume data products through a central metadata catalog, while enforcing permission policies and tightly integrating data engineering, analytics, and machine learning services to streamline the user journey from data to insight. By building the CDH, BMW realized improved efficiency, performance and sustainability throughout the automotive lifecycle, from design to after-sales services.

With over 10 PB of data across 1,500 data assets, 1,000 data use cases, and more than 9000 users, the BMW CDH has become a resounding success since BMW decided to build it in a strategic collaboration with Amazon Web Services (AWS) in 2020. However, the initial version of CDH supported only coarse-grained access control to entire data assets, and hence it was not possible to scope access to data asset subsets. This led to inefficiencies in data governance and access control.

AWS Lake Formation is a service that streamlines and centralizes the data lake creation and management process. One of its key features is fine-grained access control, which allows customers to granularly control access to their data lake resources at the table, column, and row levels. This level of control is essential for organizations that need to comply with data governance and security regulations, or those that deal with sensitive data.

With fine-grained access control, customers can define and enforce data access policies based on various criteria, such as user roles, data classifications, or data sensitivity levels. This makes sure that only authorized users or applications can access specific data sets or portions of data, but also reduces the risk of unauthorized access or data breaches. Additionally, Lake Formation integrates with AWS Identity and Access Management (IAM) and other AWS services so customers can use existing security and access management practices within their data lake environment.

This post explores how BMW implemented AWS Lake Formation‘s fine-grained access control (FGAC) in the CDH and how this saves them up to 25% on compute and storage costs.

The Solution: How BMW CDH solved data duplication

The CDH is a company-wide data lake built on Amazon Simple Storage Service (Amazon S3). The CDH serves as a centralized repository for petabytes of data from engineering, manufacturing, sales, and vehicle performance and provides BMW employees with a unified view of the organization and acts as a starting point for new development initiatives. It streamlines access to various AWS services, including Amazon QuickSight, for building business intelligence (BI) dashboards and Amazon Athena for exploring data. Many of these services are embedded into the CDH data portal, which offers a web-based user interface for accessing and interacting with the platform. It allows users to discover datasets, manage data assets, and consume data for their use cases. The architecture is shown in the following figure.

The BMW CDH follows a decentralized, multi-account architecture to foster agility, scalability, and accountability. It comprises distinct AWS account types, each serving a specific purpose. The following account types are relevant for implementation:

  • Resource accounts: Accounts are used for centralized storage repositories, hosting the datasets and their associated metadata across different stages (such as development, integration, and production) and AWS Regions.
  • Consumer accounts: Used by data consumers to implement use cases insights and build applications tailored to their business needs.
  • CDH control plane account: This account contains the APIs for creating filter packages and controlling access. A filter package provides a restricted view of a data asset by defining column and row filters on the tables.

The following are the three key roles within the CDH’s decentralized architecture:

  • Data providers, who provision data assets in resource accounts
  • Data stewards, who govern data assets
  • Use cases (data consumers), which use data assets to derive insights and build applications inside of consumer accounts to support decision-making processes.

For example, a global sales dataset is created by a team of data engineers with the data provider role. A data analyst in a local market who wants to derive insights from the global sales data can create a use case with a dedicated AWS consumer account and request access to the dataset from a data steward.

This multi-account strategy promotes a clear separation of concerns, empowering data producers and consumers to operate independently while using the centralized governance and services provided by the solution. The following figure illustrates how Lake Formation is used across the resource and consumer accounts in the CDH to provide FGAC to use cases.

The CDH uses the AWS Glue in resource accounts as a technical metadata catalog and data assets are stored in Amazon S3. Both the data catalog and the locations in Amazon S3 are registered with Lake Formation so that it can govern data access. Data catalogs and tables are shared with consumer accounts and use cases through AWS Resource Access Manager (AWS RAM). With Lake Formation, BMW can control access to data assets at different granularities, such as permissions at the table, column, or row level. Users can then use a Lake Formation integrated engine such as Amazon Athena to access only the data they need, removing the need to duplicate data. For example, to restrict access to a global sales data asset, BMW can now specify row filters in Lake Formation using the PartiQL language, filtering rows based on the country column of the data asset.

Data stewardship: Managing fine-grained access control

At the core of the CDH FGAC implementation lies the concept of filter packages. A filter package provides a selective view of a data asset by defining column and row filters on the tables. Multiple filter packages can be defined for a data asset to create suitable views for different use cases. In our example of the global sales dataset, a data steward creates a filter package for each local market that restricts access to the relevant rows and columns. Data stewards create and manage these packages through the CDH interface. These filter packages are implemented using Lake Formation row-level and column-level access control mechanisms. The following figure illustrates these concepts.

When creating a filter package, data stewards can specify the desired access level for individual tables within their data asset: Full access grants permissions to all columns and rows, None denies access to an entire table, while Filtered allows for granular row-level and column-level access controls.

For filtered access, data stewards use PartiQL queries to define row-level filters on tables, selecting only the rows that meet specific criteria. Additionally, they can specify column-level filters by selecting the accessible columns.

After filter packages have been created and published, they can be requested. Data stewards can review incoming requests and grant or deny access through the CDH interface, making sure that only authorized environments can access sensitive data.

Using fine-grained access control in use cases

Use case owners can browse and search for relevant data assets in the CDH, and then request full or scoped access. The CDH provides a clear overview of the available filter packages, allowing them to select the appropriate level of access based on their use case.

After access is granted to a filter package by the data steward, the filters are enforced for the use case using Lake Formation. Use case owners can further control access at the row and column level for individual users or roles within their use case account using Lake Formation. For example, they can create another column filter to hide a particular column for a particular group of users and provide unfiltered access to another group of users.

Gradual deployment with Lake Formation hybrid access mode

One of the challenges in implementing changes in access control within an existing data lake such as the CDH is the need to coordinate migration between data providers and consumers. To address this, Lake Formation offers a hybrid access mode to facilitate a gradual transition to FGAC without disrupting existing data access patterns.

In hybrid access mode, data providers can activate Lake Formation for new dataset consumers while existing consumers continue to access the data using the legacy permission model. This approach makes sure that consumers can migrate to FGAC at their own pace, minimizing the impact on their existing workloads and processes. A use case account is only switched to Lake Formation permissions for a dataset when it requests access to a filter package. This hybrid approach allows providers and consumers to migrate at their own pace, maintaining a smooth transition to the new access control model.

How BMW saves money by using Lake Formation

As the CDH grew, it became apparent that data was often duplicated for access control purposes. This issue was particularly evident with data assets containing sales data of all markets where BMW operates. Local markets were only eligible to see their own data, and to achieve this, subsets of global data assets had to be duplicated to create isolated local variants. While this approach succeeded in fulfilling access control requirements, it led to increased storage costs, higher compute expenses for data processing and drift detection, and project delays because of time-consuming provisioning processes and governance overhead. At one point, 25% of all data assets in the CDH were duplicates, a natural consequence of these measures.

With Lake Formation, creating these duplicates is no longer necessary. Data stewards can restrict access to global datasets on column and row level to comply with governance requirements. Not only does this reduce the cost for data processing, storage, development and maintenance, it also minimizes the opportunity cost of delayed data access.

Conclusion

By using AWS Lake Formation fine-grained access control capabilities, BMW has transparently implemented finer data access management within the Cloud Data Hub. The integration of Lake Formation has enabled data stewards to scope and grant granular access to specific subsets of data, reducing costly data duplication. This approach enables BMW to save up to 25% on compute and storage costs while reducing governance overhead costs. The hybrid access mode implementation further facilitates a smooth transition to the new access control model, allowing data providers and consumers to migrate at their own pace without disrupting existing workloads and processes. To dive deeper into how to replicate BMWs data success story, check out the AWS blog post on building a data mesh with Amazon Lake Formation and AWS Glue.


About the authors

Ruben Simon is a Head of Product for BMW’s Cloud Data Hub, the company’s largest data platform. He is passionate about driving digital transformation in aata, analytics, and AI, and thrives on collaborating with international teams. Outside the office, Ruben cherishes family time and has a keen interest in continual learning.

Khalid Al Khalili is a Data Architect at BMW Group, leading the architecture of the Cloud Data Hub, BMW’s central platform for data innovation. He is a strong advocate for creating seamless data experiences, transforming complex requirements into efficient, user-friendly solutions. When he’s not building new features, Khalid enjoys collaborating with his peers and cross-functional teams to advance and shape BMW’s data strategy, ensuring it stays ahead in a rapidly evolving landscape.

Florian Seidel is a Global Solutions Architect specializing in the automotive sector at AWS. He guides strategic customers in harnessing the full potential of cloud technologies to drive innovation in the automotive industry. With a passion for analytics, machine learning, AI, and resilient distributed systems, Florian helps transform cutting-edge concepts into practical solutions. When not architecting cloud strategies, he enjoys cooking for family and friends and experimenting with electronic music production.

Aishwarya Lakshmi Krishnan is a Senior Customer Solutions Manager with AWS Automotive. She is passionate about solving business problems using generative AI and cloud based technologies.

Durga Mishra is a Principal solutions architect at AWS. Outside of work, Durga enjoys spending time building new things and spend time with family and loves to hike on Appalachian trails and spend time in nature.