AWS Database Blog
How OLX optimized their Amazon DynamoDB costs by scaling their query patterns
This is a guest post by Miguel Alpendre (Engineering Manager at OLX Group), Rodrigo Lopes (Senior Software Engineer at OLX Group), and Carlos Tarilonte (Senior Developer at OLX Group) in partnership with Luís Rodrigues Soares (Solution Architect at AWS).
At OLX, we operate the world’s fastest-growing network of trading platforms. Serving 300 million people every month in over 30 countries around the world, we help people buy and sell cars, find housing, get jobs, buy and sell household goods, and much more. With more than 20 well-loved local brands including Avito, OLX, Otomoto, and Property24, our solutions are built to be safe, smart, and convenient for our customers, and are powered by a team of over 10,000 people, working across five continents in offices all around the world.
At OLX Europe, we run a monitoring platform that covers all operations done on advertisements by external partners. This workload was backed by a 9.5 TB Amazon DynamoDB table, which was serving as an event store. As the company and the business continued to grow, and the usage changed from the original design, we started looking for a way to rearchitect our workload to contain costs and provide more flexibility in querying data, and transition to the new architecture with no loss of data.
In this post, we show you how we built an extensible solution that is an order of magnitude less expensive than the previous one by working together with our business stakeholders, including them in the redesign phase of this project, and with our Amazon Web Services (AWS) Solutions Architect, and taking advantage of the latest updates to AWS services.
Challenges with the original design
At OLX Europe, we run an API platform for advertisement (advert) ingestion, which is currently being used by more than 50 external partners in three countries. This system validates advert taxonomy, handles image uploading, and notifies our partners about advert operations in real time—it produces hundreds of thousands of events daily.
This API platform, shown in Figure 1 that follows, is implemented using 40 Amazon Elastic Kubernetes Service (Amazon EKS) services and AWS Lambda functions publishing events at any given time to fulfill our partner API requests. These requests go through Amazon Simple Notification Service (Amazon SNS) and Amazon Simple Queue Service (Amazon SQS) until they reach an event processor, which writes them to DynamoDB for storage. To support use cases of traceability and “replayability” of requests, we created a custom solution, Mercury Control Center (MCC), which allows our support teams to make specialized queries.
In its 2018 design, our main table was designed around the event ID, a unique field generated for each event, plus other identifiers that were part of the event definition:
Partition key | Other fields | ||||
event-id: String | transaction-id: String | advert-id: String | created-at: String | service-name: String | (…) |
Original table definition, which assumed GetItem by event-id was the main use case.
Initially, our assumption was that all interactions with the table would be GetItem
calls based on the event-id partition key. Any other filtering required, such as advert-id or transaction-id, could be done on a case-by-case basis on the application side, as we didn’t expect requests by these fields to be common. There were also fewer business users at the time, and this simple structure allowed us to start off quickly.
However, as the use of our system grew over the first few months, new use cases were discovered and the teams using MCC needed new query patterns based on either the transaction-id or the advert-id. We considered both local secondary indexes (LSIs) and global secondary indexes (GSIs) but ultimately opted for GSIs. GSIs gave us the flexibility of using a different partition key than the event-id, while LSIs not only share the partition key with the main table, but also need to be created at table creation time.
Using GSIs allowed us to have queries based on transaction-id and on advert-id, like so:
GSI partition key | Other fields projected | ||||
transaction-id: String | event-id: String | advert-id: String | created-at: String | service-name: String | (…) |
GSI for transactions, where the use case was GetItem
by transaction-id
And another GSI for queries based on the advert-id:
GSI partition key | Other fields projected | ||||
advert-id: String | event-id: String | transaction-id: String | created-at: String | service-name: String | (…) |
GSI for adverts, where use case was GetItem
by advert-id
Because of changes in business cases, the queries shifted from the main table, where we were doing GetItem
calls by event-id, to the GSIs, where we now did GetItem
by transaction-id or advert-id.
The GSIs were fulfilling the main business use cases, which was good, but no requests were going to the main table, as the business didn’t have to query by event-id anymore. The flaw in our design was that we started off with a uniquely generated ID (event-id) which would have been natural in a relational model, but wasn’t really part of the main business use cases, which centered around transactions and adverts. This was the biggest lesson in these two years; that it’s fundamental to design DynamoDB tables by working backwards from our business use cases, which requires a constant, ongoing conversation with business stakeholders to make sure their needs are faithfully represented in the table structure.
Besides this design inefficiency, there were other improvements we identified. Because our items have an average size of 6 KB and we had two GSIs with all fields projected, every PutItem
operation consumed an average of 18 write capacity units (WCUs). In addition, our GSIs didn’t use any fields as sort keys, forcing the larger part of the sorting and aggregation work to happen on the application side, in MCC, which was neither reliable nor performant. Finally, it wasn’t possible to determine the precise order of a set of events corresponding to a certain transaction ID.
DynamoDB had also changed since 2018. Time to Live (TTL) was a recent feature back then, which we never considered using due to having no initial requirements or concerns with data growth. As the data we managed kept growing, we were faced with the question of what data is really needed and when. The ability to retain only certain data was interesting but as the created-at field wasn’t a UNIX timestamp, adding a TTL field to the large majority of our items would be far too costly.
Due to the inefficiencies in our original table design, more services being integrated in our platform, and more events being generated, costs (over $10,000 per month) and data (approximately 2 billion records, over 9 TB of data) kept growing. In October 2020, we decided to take a fresh look at the problem, at DynamoDB, and at our use cases, and redesign the event store behind MCC.
Solution
We approached the problem three ways:
- Redesigning the base table so that the main use case was supported by it and new use cases could be supported through GSIs.
- Having an explicit policy for how long data needs to be kept, derived from clear business requirements.
- Ensuring that the existing data set is available for the support team to query, in case there is a need to access the data we had aggregated between 2018 and 2021.
Table redesign
We started our redesign process by talking with our business stakeholders, who had two core use cases:
- Query by transaction ID, sorted by event creation date.
- Query by advert ID, also sorted by event creation date.
In addition, it was important for the business to know the service name that generated each event, so support teams could query for and sort events generated by specific services.
Since the event ID was irrelevant for our business cases, we removed it from the new design, also following Amazon DynamoDB best practices. We started the new design with our main use case, which was querying by transaction ID.
Since the events were always needed in sorted order, we made the sort key a compound key comprised of created-at and service-name (created-at#servicename), thereby giving business users an immediate view of the progress of the adverts through our systems.
Partition key | Sort key | Other fields | ||
transaction-id: String | created-at#service-name: String | advert-id: String | TTL: String | (…) |
New core table, supporting the main business use case of query by transaction-id, sorted by creation date (created-at) and service which created the event (service-name).
We then created a single GSI with the advert ID (the second use case, as explained by our stakeholders) as a partition key and the creation date again as a sort key. The GSI only had a few fields projected, further reducing the size and cost of this index:
Partition key | Sort key | Other fields | ||
advert-id: String | created-at#service-name: String | requesting_user: String | service: String |
topic: String |
Single GSI, supporting query by advert-id, sorted by creation date(created-at) and service which created the event (service-name).
The main table item is still approximately 6 KB, but we now project far fewer fields on the GSI so our average total WCUs has remained at 6 KB, one third of the previous implementation.
Enabling TTL
We aligned with our business stakeholders (local sales teams, customer care, and external teams) and all agreed that we only need to have four months of event history to serve internal and external support requests. So, the last change we made to our table was to add a TTL field set to four months after the creation date of the event, so that the TTL process could operate on that field, following the requirements of our stakeholders.
Keeping existing data available for later analysis
We would, however, need the current data (over 10 TB) for future analysis and to correlate it with other work streams. We were doing this work and studying options to export this data out of DynamoDB exactly when AWS launched a new feature that allowed us to export DynamoDB table data to Amazon Simple Storage Service (Amazon S3), which helped us solidify our final proposal.
Transition
In November 2020, we had the new table (named Seshat) ready and in production. We subscribed all the Amazon SNS topics to a new Amazon SQS queue, from which we would then insert all the data into the redesigned DynamoDB table, in addition to the old table.
Between November 2020 and March 2021, Seshat kept growing but reads were still only coming from the old table. We only needed to have four months of event history, so after intensive testing, we pointed the reads to Seshat on March 1, 2021. We did a few final operations on the old DynamoDB table, stopped writing to it, and finally exported it to Amazon S3. The export to S3 took approximately 20 hours and afterwards we turned off everything related to the old event store, including dropping the old DynamoDB table.
The new DynamoDB architecture reduced our costs by 83%, but most importantly, we now have a table structure that’s perfectly aligned with the business, which allows us to expand to new use cases as needed. And as a final advantage, we still have all the old data since 2018 stored in S3, something that was requested by the business in case we needed to check old data for specific customer support cases.
Conclusion
Aligning and involving stakeholders in our architecture decisions was crucial to the success of the final architecture. It was from business requirements that we derived the final table design as well as the GSI design. It was also from stakeholder alignment that we were able to set up TTL on our table. All of these improvements ultimately helped us ensure that we wouldn’t end up with an over 9 TB table to manage.
Following DynamoDB best practices was also fundamental. While DynamoDB wasn’t a new service in 2018, it was new to us and it showed us that knowing and using AWS services properly is fundamental to delivering future-proof architectures. This includes applying the best practices of designing in DynamoDB, and being aware of new features being launched by following the AWS News Blog and DynamoDB documentation.
We completed the core of this work in less than two months, from the first call with our AWS Solutions Architect, Luís Rodrigues Soares, until we had a data model that we were confident could support our business use cases. We spent four months sending data to the redesigned table and making sure we had a solid process for keeping the existing data available after the migration before retiring the original table. We did this with no disruption to business, with a small team of three engineers.
The work we did opened up discussions with other parts of OLX Europe to use Seshat as the backbone for advertisement exports between internal OLX systems. These already include external partners, our real estate and car parts websites, and the global platform websites in Romania, Poland, and Portugal. Jointly, these resulted in an average volume increase per month of 800,000 advertisements and 15 million transactions without requiring any changes in the data model. All these results are a testament to a well-architected system, a reliable service, and a strong alignment between business and engineering practices.
To learn more about data modelling on DynamoDB, see the NoSQL design section of the documentation, which also includes best practices for partition keys, sort keys, and secondary indexes. For more information about DynamoDB, see the resources page for whitepapers, other blog posts, and webinars.
About the Authors
Miguel Alpendre is an Engineering Manager at OLX Group based out of Lisbon, Portugal. Having worked in sectors like e-commerce, governance and industry over the past 14 years, he is currently managing the engineering team responsible for Partner Inventory ingestion and lifecycle in the Real Estate classifieds domain. Miguel is passionate about diverse topics such as Domain-Driven Design, Event driven Architecture or Microservices Observability.
Rodrigo Lopes has been working as a software engineer for the last 17 years, in various industries, from telecom, through education to e-commerce, and across different locations (from Latin America to Europe). He started the journey with Go 5 years ago, and since then gopher is always by his side. Rodrigo is driving the adoption of Go in his current projects and coaches other engineers.
Carlos Tarilonte is a senior developer in OLX based in Lisbon, Portugal. He has more than 17 years of experience in IT industry, specifically in architecture, cloud solutions, migrating complex systems, database design, fighting capitalism and advocating engineering good practices.
Luís Rodrigues Soares works with Game Development Studios across Europe to help them build, run and grow their games on AWS. Besides AWS for Games, he has a passion for well crafted products, and uses his Software Engineering and CTO experience to help people make the best out of Amazon Web Services (AWS). Outside of work, he loves solving math and coding puzzles, and he writes fan fiction for his favorite video game, “Elite: Dangerous”.