AWS HPC Blog
How Rivian modernized engineering simulation using AWS
This post was contributed by Ameya Kamerkar (Rivian), Vikram Pendyam (Rivian), Abhishek Chauhan (Rivian), Ajay Paknikar (AWS), Sandeep Sovani (AWS)
Figure 1. Rivian’s custom Amazon Electric Delivery Vehicle (EDV) (Credits: Rivian media kit)
In this post, we share how Rivian, a leading electric vehicle manufacturer, revolutionized their engineering simulation capabilities by migrating to AWS and developing SimOS, an innovative simulation orchestration system. This transformation not only solved immediate infrastructure challenges but also established a foundation for accelerated vehicle development through virtual engineering, setting a new standard for the automotive industry.
The Journey to Cloud HPC
Rivian is headquartered in Irvine California and is at the forefront of electric vehicle innovation. They produce the R1T pickup truck, R1S SUV, and the Electric Delivery Van (EDV) at their manufacturing facility in Normal, Illinois. They have a global workforce and some ambitious plans involving the new R2 vehicle line. They’ve also established a joint venture with Volkswagen for electric vehicle software. With all of this, Rivian’s engineering needs are both substantial and growing fast.
In early 2020, Rivian encountered a critical challenge that would become the catalyst for their cloud migration. Their on-premises research and development IT infrastructure, including their HPC cluster, failed – and without an ability to recover. This failure threatened a six-month delay in their R1/EDV vehicle program timeline, a potentially devastating setback for a rapidly growing company.
Faced with this crisis, Rivian’s leadership recognized the need for a more robust and scalable solution. They turned to AWS to explore cloud-based alternatives that could not only resolved their immediate issues but also provided a foundation for future growth and innovation.
The urgency of the situation led to an accelerated proof-of-concept process, which successfully completed in just three weeks. Based on the promising results, Rivian made the bold decision to fully migrate their production HPC environment to AWS by early Q2 2020. Details on this migration and the results it achieved are documented in a case study.
The business imperative for advanced simulation
The drive toward cloud HPC wasn’t just about solving infrastructure problems. It was rooted in a deeper understanding of the evolving landscape of automotive development and the critical role of engineering simulation in this process. Traditional automotive development relies heavily on physical prototypes, each costing between $2-4 million. For a growing manufacturer like Rivian, focused on achieving positive margins while pushing the boundaries of electric vehicle technology, this approach was financially unsustainable and operationally impractical.
Rivian’s leadership recognized that expanding their virtual engineering capabilities through HPC was not just a technical necessity but a strategic imperative. By shifting more of their development process into the virtual realm, they could dramatically reduce the number of physical prototypes needed, accelerate their development cycles, and ultimately bring innovative vehicles to market faster and more cost-effectively.
This realization set the stage for a comprehensive transformation of Rivian’s approach to product engineering, with cloud-based HPC at its core. The goal was to create an environment where engineers could run complex simulations, analyze results, and iterate on designs with unprecedented speed and flexibility.
The evolution of Rivian’s cloud HPC system
Rivian’s initial deployment on AWS used an open-source solution called SOCA (Scale-Out Computing on AWS) along with various storage solutions including Amazon EFS. This provided an immediate solution to their pressing needs, allowing them to resume critical engineering work without the long delay that rebuilding an on-premises system would have entailed.
However, as the team became more familiar with cloud operations and AWS continued to develop new technologies, opportunities for optimization became apparent. Performance benchmarks revealed areas where different instance types or configurations could yield better results. Cost analysis showed potential for more efficient resource allocation, and feedback from engineers highlighted the need for more intuitive user interfaces and streamlined workflows.
Rivian’s IT team embarked on a series of collaborative workshops and brainstorming sessions with the AWS HPC teams. These proved invaluable and helped them understand AWS ParallelCluster and best practices for large-scale simulation workloads in the cloud.
From this Rivian developed a strategy to improve their capabilities. This focused on several key areas.
- Performance optimization: This involved careful selection and benchmarking of Amazon Elastic Compute Cloud (Amazon EC2) instance types and custom application settings for different simulation workloads. The team evaluated various CPU and GPU-based instances, considering factors like memory bandwidth, network performance, and cost-efficiency for each type of simulation and choose the best instance types for each of their applications.
- Cost management: Rivian implemented sophisticated resource allocation mechanisms, using features like EC2 Spot Instances and Amazon EC2 Auto Scaling groups. These optimized costs without sacrificing performance.
- User experience: Recognizing that the power of HPC is only valuable if engineers can easily access and use it, Rivian prioritized the development of intuitive interfaces and workflows.
- Data management: The team implemented a comprehensive data lifecycle management strategy, ensuring that simulation data was stored, accessed, and archived efficiently across its entire lifecycle.
This strategic approach laid the groundwork for their development of SimOS: Rivian’s simulation orchestration system.
The birth of SimOS
As Rivian’s cloud approach to HPC matured, a key insight emerged from user feedback: while AWS ParallelCluster provided superior technical capabilities, many engineers missed the intuitive interface of SOCA. This realization sparked the development of SimOS, a custom simulation orchestration system that would become the cornerstone of Rivian’s virtual engineering environment.
SimOS was designed to combine the power and flexibility of ParallelCluster with a user-friendly front end that would make advanced HPC capabilities accessible to all engineers, regardless of their expertise in cloud computing. The system offers a range of features that streamline the simulation process:
- Seamless job submission across multiple solvers allows engineers to focus on their work rather than infrastructure details.
- Integrating simulation results with visualization enables quick analysis and iteration.
- Automated resource allocation across AWS Regions and Availability Zones (AZs), ensures best performance, high availability, resilience, and cost-efficiency.
- Real-time job monitoring and analytics provides insights into resource utilization and job progress.
- Comprehensive data lifecycle management, from initial storage to long-term archiving.
Figure 2. Examples of SimOS’s easy-to-use user interface.
The development of SimOS was a collaborative effort, involving close coordination between Rivian’s IT team, Rivian simulation engineers, and AWS specialists. The result is a system that meets Rivian’s current needs but is flexible enough to adapt to future requirements as the company continues to grow and evolve.
Technical architecture
Rivian’s current HPC architecture use several AWS services to create a robust, scalable, and high-performance environment. At its core, AWS ParallelCluster with SLURM handles job orchestration, allowing for efficient distribution of workloads across a dynamically scaling cluster of compute instances.
To maximize resource availability and resilience, they deployed the infrastructure across multiple AWS Regions. This ensures that Rivian can always access the compute resources they need, even in the face of regional capacity constraints or unexpected events that impact availability.
The storage infrastructure combines two solutions. They use Amazon FSx for Lustre for high-performance parallel file system access, crucial for I/O-intensive simulation workloads. And they use Amazon FSx for NetApp ONTAP and Flexcache for more general-purpose storage needs. This hybrid approach allows Rivian to balance performance and cost across different types of data and workloads.
Visualization is particularly important in post-processing of simulation results, so Rivian uses Amazon EC2 G4 and G5 instances with Amazon DCV, which is a remote display protocol custom-designed for graphics intensive remote visualization applications. These instances are strategically located close to the centralized storage, minimizing data transfer times and enabling smooth, interactive visualization experiences.
For their compute infrastructure, Rivian is moving from the older generation C and R instances to the newer generation C/R instances (C7a, R7a) and HPC (HPC7a) instances, tracking new developments from both AWS and CPU makers. The new architecture lets them switch to new instances easily, so they can try them when they are launched.
Monitoring and observability involves the integration of Amazon CloudWatch with Grafana Dashboards. These provide Rivian’s IT team with real-time insights into the performance and health of their HPC infrastructure, allowing for proactive management – and optimization.
Figure 3: Architectural elements of Rivian’s new cloud HPC system.
Business results and impact
The transformation of Rivian’s HPC infrastructure has delivered remarkable benefits across multiple areas of the business. Through the implementation of data lifecycle management and strategic use of different storage tiers, they’ve achieved substantial reductions in storage costs.
At the same time, performance improved across successive generations of HPC instances, from c5n through hpc7a – and these enabled more complex simulations and faster turnaround times.
Perhaps the most significant outcome is that the cost savings and efficiency gains allowed Rivian to reinvest in their virtual engineering capabilities. They expanded their portfolio of software licenses for computer aided engineering (CAE) and this enabled a broader use of simulation across different disciplines and stages of the vehicle development process.
The elimination of queue-wait-times had a profound impact on product development cycles. Engineers can now run more design iterations, identify (and resolve) issues earlier in the development process, and collaborate more effectively through centralized data management. This acceleration of the development cycle is crucial for Rivian as they work towards the launch of their R2 vehicle line and expand their partnership with Volkswagen.
Rivian CIO, Ger Dwyer said that “… AWS has been a true efficiency multiplier for Rivian. By moving HPC workloads to the cloud, we’ve reduced costs, improved performance, and enabled our engineers to focus on what matters most, designing the future of electric mobility”
Figure 4: A screenshot of the SimOS interface showing ability to visually compare results of multiple simulation runs.
Looking ahead: the future of virtual engineering at Rivian
As Rivian prepares for the launch of their R2 vehicle line in 2026, their enhanced HPC capabilities are playing a crucial role in accelerating development while maintaining the highest quality standards. The company’s forward-looking strategy includes several key initiatives.
SimOS evolution: The platform continues to evolve, with ongoing enhancements to the user interface and experience. Plans include the integration of additional solvers and simulation types, expansion of data analytics capabilities, and implementation of advanced workflow automation features.
Artificial intelligence integration: Rivian is exploring various applications of AI in their virtual engineering processes. They’re developing smart scheduling algorithms to optimize resource use, integration of machine learning for enhanced simulation speed (ML4CAE), and exploration of generative AI applications in vehicle design and engineering.
Expanded collaboration: Building on the success of their joint venture with Volkswagen, Rivian is looking at ways to leverage their cloud HPC infrastructure to facilitate collaboration with partners and suppliers, potentially revolutionizing the automotive supply chain.
Sustainability focus: In line with their commitment to sustainability, Rivian is working with AWS to optimize the energy efficiency of their cloud HPC workloads, exploring the use of carbon-aware computing techniques.
Conclusion
Rivian’s journey started with an on-premises HPC crisis but evolved into a sophisticated new simulation environment demonstrating the transformative potential of cloud computing in automotive engineering. Through close collaboration with AWS, and innovative solutions like SimOS, Rivian has created a virtual engineering platform that exceeds their needs and positions them for future growth and innovation.
The success of this transformation goes beyond technical achievements – it represents a fundamental shift in how automotive development can be done. They’ve made it more efficient, cost-effective, and agile. As the automotive industry continues to electrify, Rivian’s approach to cloud HPC provides a blueprint for other manufacturers looking to accelerate their development cycles, too.
By working with AWS and developing tailored solutions like SimOS, Rivian overcame some big challenges and positioned themselves at the forefront of automotive innovation. As they keep pushing boundaries of electric vehicle technology, their cloud HPC infrastructure will remain a critical enabler of their ambitious goals.