AWS Big Data Blog
How smava makes loans transparent and affordable using Amazon Redshift Serverless
This is a guest post co-written by Alex Naumov, Principal Data Architect at smava.
smava GmbH is one of the leading financial services companies in Germany, making personal loans transparent, fair, and affordable for consumers. Based on digital processes, smava compares loan offers from more than 20 banks. In this way, borrowers can choose the deals that are most favorable to them in a fast, digitalized, and efficient way.
smava believes in and takes advantage of data-driven decisions in order to become the market leader. The Data Platform team is responsible for supporting data-driven decisions at smava by providing data products across all departments and branches of the company. The departments include teams from engineering to sales and marketing. Branches range by products, namely B2C loans, B2B loans, and formerly also B2C mortgages. The data products used inside the company include insights from user journeys, operational reports, and marketing campaign results, among others. The data platform serves on average 60 thousand queries per day. The data volume is in double-digit TBs with steady growth as business and data sources evolve.
smava’s Data Platform team faced the challenge to deliver data to stakeholders with different SLAs, while maintaining the flexibility to scale up and down while staying cost-efficient. It took up to 3 hours to generate daily reporting, which impacted business decision-making when re-calculations needed to happen during the day. To speed up the self-service analytics and foster innovation based on data, a solution was needed to provide ways to allow any team to create data products on their own in a decentralized manner. To create and manage the data products, smava uses Amazon Redshift, a cloud data warehouse.
In this post, we show how smava optimized their data platform by using Amazon Redshift Serverless and Amazon Redshift data sharing to overcome right-sizing challenges for unpredictable workloads and further improve price-performance. Through the optimizations, smava achieved up to 50% cost savings and up to three times faster report generation compared to the previous analytics infrastructure.
Overview of solution
As a data-driven company, smava relies on the AWS Cloud to power their analytics use cases. To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.
smava ingests data from various external and internal data sources into a landing stage on the data lake based on Amazon Simple Storage Service (Amazon S3). To ingest the data, smava uses a set of popular third-party customer data platforms complemented by custom scripts.
After the data lands in Amazon S3, smava uses the AWS Glue Data Catalog and crawlers to automatically catalog the available data, capture the metadata, and provide an interface that allows querying all data assets.
Data analysts who require access to the raw assets on the data lake use Amazon Athena, a serverless, interactive analytics service for exploration with ad hoc queries. For the downstream consumption by all departments across the organization, smava’s Data Platform team prepares curated data products following the extract, load, and transform (ELT) pattern. smava uses Amazon Redshift as their cloud data warehouse to transform, store, and analyze data, and uses Amazon Redshift Spectrum to efficiently query and retrieve structured and semi-structured data from the data lake using SQL.
smava follows the data vault modeling methodology with the Raw Vault, Business Vault, and Data Mart stages to prepare the data products for end consumers. The Raw Vault describes objects loaded directly from the data sources and represents a copy of the landing stage in the data lake. The Business Vault is populated with data sourced from the Raw Vault and transformed according to the business rules. Finally, the data is aggregated into specific data products oriented to a specific business line. This is the Data Mart stage. The data products from the Business Vault and Data Mart stages are now available for consumers. smava decided to use Tableau for business intelligence, data visualization, and further analytics. The data transformations are managed with dbt to simplify the workflow governance and team collaboration.
The following diagram shows the high-level data platform architecture before the optimizations.
Evolution of the data platform requirements
smava started with a single Redshift cluster to host all three data stages. They chose provisioned cluster nodes of the RA3 type with Reserved Instances (RIs) for cost optimization. As data volumes grew 53% year over year, so did the complexity and requirements from various analytic workloads.
smava quickly addressed the growing data volumes by right-sizing the cluster and using Amazon Redshift Concurrency Scaling for peak workloads. Furthermore, smava wanted to give all teams the option to create their own data products in a self-service manner to increase the pace of innovation. To avoid any interference with the centrally managed data products, the decentralized product development environments needed to be strictly isolated. The same requirement was also applied for the isolation of different product stages curated by the Data Platform team.
Optimizing the architecture with data sharing and Redshift Serverless
To meet the evolved requirements, smava decided to separate the workload by splitting the single provisioned Redshift cluster into multiple data warehouses, with each warehouse serving a different stage. In addition, smava added new staging environments in the Business Vault to develop new data products without the risk of interfering with existing product pipelines. To avoid any interference with the centrally managed data products of the Data Platform team, smava introduced an additional Redshift cluster, isolating the decentralized workloads.
smava was looking for an out-of-the-box solution to achieve workload isolation without managing a complex data replication pipeline.
Right after the launch of Redshift data sharing capabilities in 2021, the Data Platform team recognized that this was the solution they had been looking for. smava adopted the data sharing feature to have the data from producer clusters available for read access on different consumer clusters, with each of those consumer clusters serving a different stage.
Redshift data sharing enables instant, granular, and fast data access across Redshift clusters without the need to copy data. It provides live access to data so that users always see the most up-to-date and consistent information as it’s updated in the data warehouse. With data sharing, you can securely share live data with Redshift clusters in the same or different AWS accounts and across Regions.
With Redshift data sharing, smava was able to optimize the data architecture by separating the data workloads to individual consumer clusters without having to replicate the data. The following diagram illustrates the high-level data platform architecture after splitting the single Redshift cluster into multiple clusters.
By providing a self-service data mart, smava increased data democratization by providing users with access to all aspects of the data. They also provided teams with a set of custom tools for data discovery, ad hoc analysis, prototyping, and operating the full lifecycle of mature data products.
After collecting operational data from the individual clusters, the Data Platform team identified further potential optimizations: the Raw Vault cluster was under steady load 24/7, but the Business Vault clusters were only updated nightly. To optimize for costs, smava used the pause and resume capabilities of Redshift provisioned clusters. These capabilities are useful for clusters that need to be available at specific times. While the cluster is paused, on-demand billing is suspended. Only the cluster’s storage incurs charges.
The pause and resume feature helped smava optimize for cost, but it required additional operational overhead to trigger the cluster operations. Additionally, the development clusters remained subject to idle times during working hours. These challenges were finally solved by adopting Redshift Serverless in 2022. The Data Platform team decided to move the Business Data Vault stage clusters to Redshift Serverless, which allows them to pay for the data warehouse only when in use, reliably and efficiently.
Redshift Serverless is ideal for cases when it’s difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. Additionally, as usage demand evolves with new workloads and more concurrent users, Redshift Serverless automatically provisions the right compute resources, and the data warehouse scales seamlessly and automatically, without the need for manual intervention. Data sharing is supported in both directions between Redshift Serverless and provisioned Redshift clusters with RA3 nodes, so no changes to the smava architecture were needed. The following diagram shows the high-level architecture setup after the move to Redshift Serverless.
smava combined the benefits of Redshift Serverless and dbt through a seamless CI/CD pipeline, adopting a trunk-based development methodology. Changes on the Git repository are automatically deployed to a test stage and validated using automated integration tests. This approach increased the efficiency of developers and decreased the average time to production from days to minutes.
smava adopted an architecture that utilizes both provisioned and serverless Redshift data warehouses, together with the data sharing capability to isolate the workloads. By choosing the right architectural patterns for their needs, smava was able to accomplish the following:
- Simplify the data pipelines and reduce operational overhead
- Reduce the feature release time from days to minutes
- Increase price-performance by reducing idle times and right-sizing the workload
- Achieve up to three times faster report generation (faster calculations and higher parallelization) at 50% of the original setup costs
- Increase agility of all departments and support data-driven decision-making by democratizing access to data
- Increase the speed of innovation by exposing self-service data capabilities for teams across all departments and strengthening the A/B test capabilities to cover the complete customer journey
Now, all departments at smava are using the available data products to make data-driven, accurate, and agile decisions.
Future vision
For the future, smava plans to continue to optimize the Data Platform based on operational metrics. They’re considering switching more provisioned clusters like the Self-Service Data Mart cluster to serverless. Additionally, smava is optimizing the ELT orchestration toolchain to increase the number of parallel data pipelines to be run. This will increase the utilization of provisioned Redshift resources and allow for cost reductions.
With the introduction of the decentralized, self-service for data product creation, smava made a step forward towards a data mesh architecture. In the future, the Data Platform team plans to further evaluate the needs of their service users and establish further data mesh principles like federated data governance.
Conclusion
In this post, we showed how smava optimized their data platform by isolating environments and workloads using Redshift Serverless and data sharing features. Those Redshift environments are well integrated with their infrastructure, flexible in scaling on demand, and highly available, and they require minimum administration efforts. Overall, smava has increased performance by three times while reducing the total platform costs by 50%. Additionally, they reduced operational overhead to a minimum while maintaining the existing SLAs for report generation times. Moreover, smava has strengthened the culture of innovation by providing self-service data product capabilities to speed up their time to market.
If you’re interested in learning more about Amazon Redshift capabilities, we recommend watching the most recent What’s new with Amazon Redshift session in the AWS Events channel to get an overview of the features recently added to the service. You can also explore the self-service, hands-on Amazon Redshift labs to experiment with key Amazon Redshift functionalities in a guided manner.
You can also dive deeper into Redshift Serverless use cases and data sharing use cases. Additionally, check out the data sharing best practices and discover how other customers optimized for cost and performance with Redshift data sharing to get inspired for your own workloads.
If you prefer books, check out Amazon Redshift: The Definitive Guide by O’Reilly, where the authors detail the capabilities of Amazon Redshift and provide you with insights on corresponding patterns and techniques.
About the Authors
Alex Naumov is a Principal Data Architect at smava GmbH, and leads the transformation projects at the Data department. Alex previously worked 10 years as a consultant and data/solution architect in a wide variety of domains, such as telecommunications, banking, energy, and finance, using various tech stacks, and in many different countries. He has a great passion for data and transforming organizations to become data-driven and the best in what they do.
Lingli Zheng works as a Business Development Manager in the AWS worldwide specialist organization, supporting customers in the DACH region to get the best value out of Amazon analytics services. With over 12 years of experience in energy, automation, and the software industry with a focus on data analytics, AI, and ML, she is dedicated to helping customers achieve tangible business results through digital transformation.
Alexander Spivak is a Senior Startup Solutions Architect at AWS, focusing on B2B ISV customers across EMEA North. Prior to AWS, Alexander worked as a consultant in financial services engagements, including various roles in software development and architecture. He is passionate about data analytics, serverless architectures, and creating efficient organizations.
This post was reviewed for technical accuracy by David Greenshtein, Senior Analytics Solutions Architect.