Hyperconnect Doubles ML Model Training Efficiency with Amazon EC2 and Amazon EKS
2020
The internet has vastly changed our ability to connect. But the internet itself is just the vehicle; to fully experience these new levels of connectivity at maximum quality, reliability, and satisfaction, users need tech-driven innovators like Hyperconnect.
Founded in South Korea in 2014, Hyperconnect has become an industry leader by perfecting easy video chat, becoming the first company to develop web real-time communication (webRTC) for mobile platforms. Its flagship app, Azar, has been downloaded 400 million times, and Hyperconnect’s various livestreaming services support its mission to create a connected society.
Building on a proven record of success and user satisfaction, Hyperconnect identified a need to further streamline its tech infrastructure: its AI-based image classification machine learning models used to recognize an Azar user’s environment took weeks to train. By transferring some of its research and computing workloads to Amazon Web Services (AWS), Hyperconnect dramatically reduced training time for its models, ultimately allowing users more control over their digital environments.
Training time went from 4 weeks to a few hours on the AWS environment.”
Beomjun Shin
ML platform leader, Hyperconnect
Bottlenecked Research and Production
Like many innovative companies, Hyperconnect found that its creativity outpaced its computing capacity. On the production side, Hyperconnect’s machine learning models—which included image classification and voice conversion—took weeks to train with on-premises workstations. “We used a single machine for a single training,” says Beomjun Shin, Hyperconnect’s machine learning platform team leader. “So every workstation had four GPU devices. At that time, all we could do was use four GPUs at once.”
This meant waiting a week or more for a model to train, seeing the results, waiting another week for more training, and repeating this process until it was right. The downtime between trainings was not only tedious—it cost Hyperconnect time, money, and efficiency. Determined to consolidate operations and boost innovation, Shin and the Hyperconnect team sought out a cloud computing service to carry the load.
Selecting the Right Cloud Platform
Hyperconnect examined its options to determine which cloud platform was the right fit. It quickly became apparent that the choice was between Amazon Elastic Compute Cloud (Amazon EC2) and Google Cloud’s Tensor Processing Unit (TPU). After careful deliberation, Hyperconnect found several reasons to go with AWS. First, working with AWS meant the Hyperconnect team didn’t have to learn the custom code used by Google Cloud. Second, PyTorch, the open-source machine learning library Hyperconnect relies on, worked seamlessly with AWS but was far less compatible with Google Cloud.
There was also the issue of compatibility with Hyperconnect’s existing architecture: “AWS allowed us to set up a very similar environment to our on-premises machines,” says Sungjoo Ha, Hyperconnect director of AI. Confident in the fit, Hyperconnect moved forward with Amazon EC2.
Speeding Up Production with AWS
The results were immediate. With Amazon EC2, Hyperconnect was able to easily obtain and configure computing capacity, and the company saw a much quicker model training process. “The training time went from 4 weeks to a few hours on the AWS environment,” says Shin.
Amazon Elastic Container Service for Kubernetes (Amazon EKS) also proved to be an important part of the architecture as Hyperconnect sought to deploy models frequently without sacrificing crucial resources. “We used Kubeflow with Amazon EKS and added the cluster autoscaler for managing Amazon EC2 instances cost effectively,” says Shin. “With Amazon EKS, we don’t have to manage Kubernetes manually. We can just focus on using Kubeflow.” With the combination of Amazon EC2 and Amazon EKS, Hyperconnect saw tangible results in increased efficiencies. “In terms of the training time, we usually see a linear reduction,” says Ha. “Adding two 2x nodes leads to a 1.8x or 1.9x increase in efficiency”—in other words, doubling the number of nodes led to almost doubling the efficiency in training time.
PyTorch versus TensorFlow
While computing capacity is essential, the kind of training Hyperconnect does also requires robust machine learning libraries, and Hyperconnect relies on both PyTorch and TensorFlow to achieve the results it’s looking for.
TensorFlow 2.x works better to deploy to Hyperconnect’s production level for mobile devices, and indeed Hyperconnect mainly relies on AWS-optimized TensorFlow for multinode training for image classification using Horovod.
Still, Hyperconnect turns to PyTorch in certain circumstances, such as for their two production workloads for face reenactment and speech synthesis. “Much of the open-source research code published online is in PyTorch,” says Ha, citing this as one of the reasons Hyperconnect uses PyTorch. For now, these resources work in concert to satisfy Hyperconnect’s diverse needs as it creates increasingly sophisticated tech.
Developing New Projects Faster with AWS
Perhaps the most striking application of Hyperconnect’s AWS-backed infrastructure is MarioNETte, one of the company’s newest projects. This innovative model can re-create dynamic faces from as little as a single image, which has a number of exciting potential uses such as video call filters and design overlays. To make this feasible, Hyperconnect needed to run a huge number of experiments in image synthesis and train models quickly to produce results in a reasonable amount of time.
“We extensively used Amazon EC2 P3 instances to try out different models, settings, hyperparameters, and sets of data to see which work and which don’t,” says Ha. “The sheer amount of compute that we required to make this work was more than we could account for if we only resorted to on-premises machines.” Hyperconnect relied on Amazon EC2 P3.16xlarge instances specifically to test out different hypotheses and zero in on the right configurations for their dynamic face models.
Following the Road Map of Hyperconnect’s Success
Before AWS, it took Hyperconnect more than 4 weeks to train its machine learning models. By reaching into the cloud, the company reduced that time to a matter of hours. Hyperconnect has already seen the potential of its new architecture—it now also uses Amazon Elastic File System (Amazon EFS) as a storage backend for distributed training on Amazon EC2 instances and Amazon Simple Storage Service (Amazon S3) for saving and backing up datasets—to supercharge the rate of innovation, resulting in many benefits for the end user: ever more sophisticated video communication with enhanced control over digital environments and features like MarioNETte to make global connectivity easier, more fulfilling, and more fun.
To learn more, visit https://thinkwithwp.com/ec2/instance-types/p3/.
About Hyperconnect
Founded in 2014, Hyperconnect specializes in applying new technologies based on machine learning to image and video processing and was the first company to develop webRTC for mobile platforms. Azar, its flagship video communication app, has been downloaded more than 400 million times.
Benefits of AWS
- Reduced training time from weeks to hours
- Maintained compatibility with PyTorch
- Scaled training capacity at a lower cost
- Received customer-centric technical support
AWS Services Used
Amazon EC2 P3 and P3dn
Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications.
Amazon EKS
AWS makes it easy to run Kubernetes in the cloud with scalable and highly-available virtual machine infrastructure, community-backed service integrations, and Amazon Elastic Kubernetes Service (EKS), a certified conformant, managed Kubernetes service.
Amazon EFS
Amazon Elastic File System (Amazon EFS) provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources.
Amazon S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
Get Started
Companies of all sizes across all industries are transforming their businesses every day using AWS. Contact our experts and start your own AWS Cloud journey today.