AWS for M&E Blog
Unleash the power of AI rendering on AWS to save time and cost
In the realm of 3D animation and visual effects, rendering is a crucial yet computationally demanding process. The intricate details and high resolutions required for stunning visuals often lead to exorbitant rendering times and costs. As the demand for high-quality computer-generated imagery (CGI) rises across industries, traditional brute-force rendering approaches strain hardware resources and energy consumption.
This blog post is a thought experiment exploring how machine learning (ML) algorithms potentially offer cost and time savings by intelligently analyzing and optimizing the rendering process. By learning from real world datasets, ML algorithms can approximate complex lighting, materials, and details, and speed up rendering. Through techniques like frame interpolation, denoising, and upscaling, implementing ML in content production can significantly reduce computation and intermediary storage requirements while maintaining visual fidelity.
Denoising ML is already implemented in many CGI commercial and open-source products, demonstrating its practical benefits. This advancement lets animators and designers iterate and experiment faster, fostering greater creativity and innovation. With the demand for CGI increasing each year, adopting ML algorithms in rendering is an industry imperative, addressing long-standing challenges and unlocking new creative possibilities.
Augment video rendering with machine learning
To demonstrate how ML can reduce rendering time, cost, and speed up rendering iterations, we use an ML frame interpolation model that takes two images and synthetizes intermediate images between the pair. Areas of fast movement between images often pose a major challenge in building such ML models. We chose a deterministic pre-trained ML interpolation model trained on open-source video clips from vimeo.com that covers a wide variety of scenes and actions.
The traditional workflow for a CGI animated feature film involves rendering at least 170K frames (for 2 hours of animation at 24 frames/second). In a production workflow, each frame typically comprises multiple image layers, each requiring its own render pass. Each layer is further rendered multiple times as artists iterate. For this thought experiment, we keep things simple by assuming a single layer with a single pass through the entire movie.
With ML interpolation, only every other frame is initially rendered, and the rest are generated by ML, reducing initial rendering computation by 50%. However, ML interpolation may yield low-quality frames in areas of large motion, which artists must identify during render checking for direct re-rendering. This number is usually only a fraction of the ML-interpolated frames but depends on model performance.
Machine learning interpolation
For demonstration purposes, we selected the award-winning animated short film Picchu, produced by Amazon Web Services (AWS), for access to final, high-resolution, uncompressed EXR image files. Working with EXR minimizes the loss of detail in compressed or rescaled video files. While the production process on Picchu followed typical workflows with multiple rendered layers composited together, each of the frames for this video would have required upwards of 4 hours to render in a single pass.
A short 7-second clip of the original is depicted in Figure 2, where we see a high-resolution, vibrant color landscape and a girl tripping over a rock. In Figure 3, we see the same short clip, but with 50% of the frames (every other frame) have replaced with an ML interpolation of the expected frame. This ML interpolation uses the deterministic model we described previously and finishes processing in 25 seconds on 8xGPUs on a p3dn.24xlarge. When using a deterministic model, the frames generated by the interpolation will always yield the same results. Until a shot is reviewed, there’s little need to store generated ML frames, which saves on storage space while iterating.
A casual viewer may not notice that any of part of this video is created using an ML algorithm. A frame-by-frame review of this video does show that the frames exhibiting fast action present clear ML errors, which we discuss in more detail in the next section. A total of 4 or 5 frames are too poor for use, and should be re-rendered traditionally. However, this represents a nearly 96% success rate for the interpolation, or around 48% of the frames retained without the need for full rendering.
Figure 2 Short clip from the original Picchu animation.
Figure 3 Interpolated video in which 50%, or every other frame, is using an ML interpolation instead of a rendered frame.
Finding errors
To maintain consistent image quality across the animation, it’s crucial to identify when machine learning (ML) frame interpolation falls short and full rendering is required. In our experiment, we still have access to the original frames from Picchu and can compare them to the ML interpolated frames to visualize the differences, helping us spot errors. However, we can only do this because we have access to the very frames we’re trying not to render in the first place. When experimenting with ML, this canonical data to compare against is critical, and will let us evaluate the effectiveness of our interpolated frames. In Figure 4, we’ve visualized errors between the traditionally rendered frames and interpolated frames by creating a video highlighting differences—black areas show no error, while gray to white areas indicate increasing errors. Significant errors occur during fast motions, such as when the girl trips.
With confirmation that areas of greater motion show increased error, we look for ways to easily spot ML interpolation errors without the aid of fully rendered frames. We found that “optical flow”, which estimates motion between frames, positively correlates with interpolation errors (Figure 5). By rendering an optical flow layer, artists are directed to problem areas or can automatically schedule traditional rendering if estimated errors exceed a certain threshold.
Our initial experiment focused on generating a single frame in between our two traditionally rendered frames, but preliminary findings suggest interpolating smaller incremental steps, and thus generating additional frames between each rendered set of frames, can result in a more accurate middle frame. Because each ML interpolated frame is generated so quickly compared to traditional rendering, creating smaller incremental steps to achieve a more accurate output scarcely impacts the overall time comparison. Future work includes multi-step interpolation and developing an ML classifier to automatically identify frames requiring rendering based on these metrics or human feedback.
Figure 4 The squared error of the difference between the rendered frame and the ML interpolation of the frame.
Figure 5 Standard deviation of optical flow angles. Areas with high variability in flow directions tend to have larger errors.
AWS architecture
Performing ML interpolation for a full-length film with high resolution is affordable and scalable on AWS by leveraging the elastic computing power of AWS Deadline Cloud. Deadline Cloud manages the rendering compute infrastructure and offers the flexibility of integrating with your existing AWS infrastructure. As depicted in Figure 6, an artist can use their standard rendering application to connect to render nodes in the cloud with Deadline Cloud. Rendered frames are saved to an Amazon S3 Bucket. A Deadline Cloud job finishing can trigger an AWS Lambda Function execution of Python code used to initiate Amazon SageMaker inference with a container that includes the ML interpolation model.
The ML interpolation for high-resolution images requires a large amount of GPU vRAM, averaging around 25GB of vRAM for one interpolated frame at 5K resolution. Hence, at minimum we need to use a p3dn.24xlarge instance. Since each frame interpolation is independent of interpolation of other frames of the video, we can scale outwards with many GPUs to quickly process the ML interpolation of an entire movie in a few minutes versus weeks of traditional rendering. For this single 7-second shot, all 130 frames were interpolated in less than 25 seconds. The user can review each interpolated frame saved to Amazon S3 or encode frames into a video ready for viewer consumption. With the deterministic model we use here, frames can then be deleted and regenerated only when required.
Summary
CGI rendering is a time-consuming and resource-intensive process, leading to high costs for animation and visual effects studios, with demand also increasing in industries such as industrial design, architecture, engineering, and construction. Machine learning algorithms have the potential to revolutionize rendering by intelligently analyzing and optimizing the process, reducing computation time and iterative storage requirements. We used AWS Deadline Cloud and GPU instances to perform efficient ML-based frame interpolation, successfully reducing the number of traditionally rendered frames required for a high-resolution animation while maintaining visual quality. Adopting ML for rendering on scalable cloud infrastructure like AWS promises to drive innovation, improve efficiency, and unlock new creative possibilities in the animation and visual effects industries.