Videology is always striving to process more in-depth data more frequently, so customers can run advertising campaigns more effectively. To do this, the company relies on a processing platform based on the Apache Hadoop open-source big data framework. The organization’s Hadoop cluster takes raw data from various systems, cleans and enhances it, aggregates it, and sends it to an optimization engine and to a reporting system for customers. The cluster processes 15 TB of data each day. Videology works with technology company Cloudera, which provides Hadoop cluster management and technical support. Videology uses the Cloudera Manager tool to manage the cluster.
For agility and scalability, Videology decided to run its Hadoop cluster on the Amazon Web Services (AWS) Cloud. “AWS is the most prominent player in the cloud industry, and by using the AWS Cloud, we were able to scale our cluster quickly and provide low latency across the globe for our customers,” says David Ortiz, senior software engineer for Videology.
Videology had initially supported the cluster on 23 Amazon Elastic Compute Cloud (Amazon EC2) instances, utilizing local instance storage to meet its disk performance and memory needs. However, the company had challenges with instance storage. “The Amazon EC2 instance types provided a fixed amount of local storage, so the amount of data we could store was coupled with the instance we were running,” Ortiz says. “That meant we had challenges because we couldn’t optimize the amount of memory we had. We were also starting to have disk-throughput issues, which slowed down our data-ingestion and aggregation processes.” The company also sought to reduce costs. “It was costing us a lot to add storage capacity, because we had to add new Amazon EC2 instances every time,” says Paul Frederiksen, the organization’s principal DevOps engineer.
To solve its challenges, Videology knew it needed to move to a combination of current-generation instance types and Amazon Elastic Block Store (Amazon EBS) to reap the benefits of having the right compute, memory, and storage for its Hadoop implementation.
As it evaluated new instance types, Videology considered both Amazon EC2 D2 instances with local instance storage and Amazon EC2 M4 instances with Amazon EBS volumes. Although the EC2 D2 instances offered local instance storage, they did not offer enough memory for the company’s needs. However, the organization realized that, using EC2 M4 instances with Amazon EBS, it could right-size its instances with the right compute, memory, and storage required.
To address its performance and cost challenges, Videology specifically decided to deploy Amazon EBS Throughput Optimized (st1) volumes. These low-cost, persistent, block-level storage volumes use hard disk drives (HDDs) along with Amazon EC2 instances to deliver high-throughput storage for Videology’s big-data workload.
Using Amazon EBS st1 volumes, Videology can increase its amount of storage without needing to add Amazon EC2 instances. “By moving to Amazon EBS st1 volumes, we get a more favorable ratio of compute cores to memory, in addition to gaining better disk throughput,” says Frederiksen. “With these capabilities, we can right-size our Hadoop cluster to optimize compute and memory independently of storage capacity.” Videology is also able to leverage the Elastic Volumes feature in Amazon EBS to dynamically increase capacity, tune performance, and change the type of its live volumes with no downtime. This allows the company to easily adapt to the changing storage needs of its Hadoop cluster. The company also takes advantage of Amazon Simple Storage Service (Amazon S3) to store data before it is sent to Hadoop.
Using Amazon EBS, which decouples storage from Amazon EC2 instances, Videology has more flexibility in managing its Hadoop cluster. “Amazon EBS separates compute resources from storage, so we can make better choices about CPU and memory utilization,” says Frederiksen. “And if an instance goes down, we can quickly detach the instance, launch a new one, and reattach the data—without having to create a new node. In our previous environment, we would lose all the data if we lost an instance. In addition, it would take us at least a day to remove the data from the instance and get it ready for use again. Now, that process takes just a few hours.”
Videology has also improved the processing performance of its Hadoop cluster by choosing Amazon EBS st1 volumes, which were the right choice for the company’s throughput-intensive workloads. “Moving to Amazon EBS st1 volumes gave us faster disk throughput,” says Ortiz. “We no longer have the bottleneck we used to have, so our data ingestion and aggregation processes are faster. As a result, we can speed our data analysis and ultimately get advertising data to our customers faster.”
In addition, the stronger performance of the Hadoop cluster has contributed to time savings. “It used to take us 40 minutes to copy data into Amazon S3, but now it’s down to less than seven minutes,” says Ortiz. Also, Videology has eliminated the need to rebuild nodes. Using Amazon EBS, data stays on the EBS volume even if an Amazon EC2 instance goes down. This means the company only needs to restart an instance, without having to bring the data from Amazon S3 into local storage—a time-consuming process. “We were seeing nodes fail at an average of one per month,” Ortiz says. “Then we would have to alert our customers that their reporting data would be delayed, because it would take us several days to fix the problem and create new nodes. Since moving to Amazon EBS volumes, we haven’t had to rebuild data because of node failures.”
The company has realized considerable cost savings by moving to Amazon EBS volumes. “We were able to save $15,000 per month, increase our available storage by five percent, and turn off eight server nodes by moving to Amazon EBS st1 volumes to support our Hadoop cluster,” says Ortiz. “We increased the amount of storage for each Amazon EC2 instance and reduced the total number of instances. Now, we simply buy and expand storage when we need to, without having to purchase additional instances to get it. Previously, we would have had to add 10 more nodes to get the amount of storage we have now, which would have been a significant cost increase.”
Videology can also more easily scale its Hadoop cluster. “Whenever the business wanted to grow capacity, we would have to add a bunch of instances to continue growing the cluster,” says Ortiz. “That would take several days and hundreds of additional dollars a month. We’re much more agile now, and we can move faster as a business to keep pace with our growth.”
Learn about the AWS big data, analytics, and business-intelligence solutions.