AWS HPC Blog
Electronic design at the speed of Lightmatter: transforming EDA workloads with RES
This post was contributed by Bilal Hussain, Head of Infrastructure, Lightmatter.co, Cole Calistra, Principal Startup Solutions Architect at AWS, and Ritz Martinovic, Principal Technical Product Manager for AWS Advanced Computing and Simulation.
If you’ve ever wondered how your electronic gadgets and computer chips get designed, the answer lies in the world of electronic design automation (EDA). EDA is a comprehensive suite of software tools that engineers use to design, test, and optimize electronic systems from tiny integrated circuits all the way up to massive circuit boards. The EDA process is a multi-stage journey that circuits have to go through. It starts with engineers using hardware description languages, like SystemVerilog at Lightmatter, to draft new designs based on a set of size, power, and performance targets. From there, we iterate on the design by running massively parallel simulations that test across a variety of configurations and parameters.
These simulations involve immense data and compute requirements. A single System on a Chip (SoC) design can have databases over 1 TB in size, demand massive IOPS and throughput to process thousands of small files, and need 10,000 cores or more for regression testing. Finally, once the designs are fully validated, they’re prepped for manufacturing and fabricated into the actual parts that make our digital world possible. It’s a complex dance, but EDA tools help choreograph it all seamlessly.
In this post we’ll talk about how Lightmatter used AWS and specifically Research and Engineering Studio on AWS (RES) and AWS ParallelCluster to meet the demanding requirements for its EDA hardware design platform.
Who is Lightmatter?
Lightmatter is a startup that specializes in developing advanced computing solutions using photonic technology, harnessing the power of lasers. The company has projects focusing on data transport and optical processors that leverage the unique properties of light to achieve high-performance computing with significantly lower power consumption. It’s critical that engineers at Lightmatter can move quickly and deploy the compute resources they need to run complex workloads without delay. Having a fast, reliable, and scalable environment contributes to their productivity and happiness at work.
Lightmatter aims for its infrastructure to be part of the reason great engineers want to come and work with the company.
Lightmatter’s infrastructure needs expanded rapidly last year after the latest round of new funding. The company looked closely at the options available. Lightmatter’s constantly evolving needs make an on-premises deployment impractical. Hybrid solutions (on-premises and cloud) are also inefficient because engineers want both the source and results on the same system for easy analysis. This pushed Lightmatter to a pure cloud-based solution. After evaluating different cloud providers, the company decided on AWS because the team there was well versed in the challenges of semiconductor design workflows and they have mature HPC solutions.
Lightmatter’s requirements
Lightmatter needed a system that could provide desktops for engineers with secure authentication. In addition, the company required the capability to spin up large storage and compute queues for large simulations which can take up to 3TB of RAM and 96 vCPUs each.
Lightmatter chose to use x2iedn.24xlarge and r7i.48xlarge Amazon Elastic Compute Cloud (Amazon EC2) instance types for these workloads.
There was also a requirement for regression testing, which can comprise thousands of individual workloads that each have their own vCPU requirements. Here, the vast array of instance types that Amazon EC2 provides enabled Lightmatter to right-size their workloads by being able to pair the specific workload to instances that are compute optimized, memory optimized, storage optimized, general purpose, or even accelerated with GPUs.
Rather than the traditional centralized command and control IT infrastructure, Lightmatter needed a system where the engineers could each configure the resources they needed quickly and efficiently without having to wait for someone else to deploy a system for them.
The solution
Lightmatter worked closely with their AWS account team and developed a solution that uses:
- Research and Engineering Studio on AWS (RES) to create virtual workstations
- AWS ParallelCluster to manage HPC clusters
- The Slurm scheduler to manage job queues
- Amazon FSx for OpenZFS for fast shared storage
- Jenkins for automatic task scheduling
RES, Slurm, and ParallelCluster – better together
From Lightmatter’s perspective, RES and Slurm with ParallelCluster are a long way ahead of the other cloud services for HPC.
- RES provides an interface for self-service virtual Linux workstations running custom operating system images on a wide range of hardware options.
- ParallelCluster with Slurm gives Lightmatter’s engineers the ability to launch a large number of parallel jobs, like simulations, on remote nodes with the same custom operating system images used in the RES environment.
Together, this allowed a seamless workflow from small-scale local testing to large-scale testing without slowing down an engineer’s personal workstation.
RES provides Lightmatter’s engineers and researchers easy access to cloud resources without expecting them to be experts in cloud or even have their own AWS accounts. The company’s infrastructure team set up the RES web portal, which tied into their existing identity providers for secure web client access.
Now engineers can login to their virtual desktops seamlessly using the same credentials they use normally, and start designing, simulating, and evaluating. AWS remote desktops are powered with Amazon DCV, a high-performance, low-latency remote-display protocol that enables customers to visualize data without moving it over the wire. DCV also provides highly secure infrastructure and improved streaming performance for visualization workloads. A local-like experience is critical for demanding 3D graphical applications.
Lightmatter takes advantage of RES projects to logically assign access to data and compute resources by sub-team. RES allowed the company to provision a wide variety of instance types depending on the workload requirements. Synthesis and place and route workloads require very large systems, such as x2idn.32xlarge nodes. Other simulations may require thousands of smaller systems, like the m7i.2xlarge.
Lightmatter’s engineering teams have grown rapidly. New engineers can now understand the company’s infrastructure with very little instruction; they have the flexibility to launch their own AWS virtual desktops with all the tools and software preinstalled once they have access to the system.
Lightmatter’s Jenkins scheduler uses Slurm with ParallelCluster to launch thousands of verification simulations and – where appropriate – coverage merges each night for multiple projects. The company also use the same Jenkins to Slurm to ParallelCluster infrastructure to launch the physical design flows.
Jenkins can detect design or flow changes and automatically launch the physical design flow. Results from this flow are sent to a dashboard and to a Slack channel. This gives the design team very fast feedback on any issues they might have introduced. Even a code change outside of working hours is taken through the flow and analyzed automatically giving the fastest feedback possible.
The ability to provide fast feedback and to provision large systems when needed are the two most valuable features of Lightmatter’s AWS based infrastructure. They’re key to achieving the speed the development teams require.
Amazon FSx for OpenZFS
Amazon FSx for OpenZFS is well-suited for EDA workloads and complements HPC due to its high throughput and low latency characteristics, as well as its ability to handle large numbers of small files efficiently. The service delivers up to a million IOPS with latencies as low as 100-200 microseconds, ideal for latency-sensitive workloads. It provides up to 10GB/s of throughput. Further, FSx for OpenZFS offers snapshots and data cloning capabilities which are useful for EDA workflows. Snapshots are vital for engineering workspaces and have already saved data on numerous occasions when Lightmatter’s engineers have accidentally deleted a file or reverted a workspace before submitting it.
FSx for OpenZFS enabled Lightmatter to migrate on-premises EDA workloads to AWS without modifying their applications or data management processes, while providing the performance and features required for these demanding workloads. Its fully-managed nature eliminates administrative overhead for patching, backups, and hardware provisioning. FSx for OpenZFS looks like a standard NFS disk for EDA applications – vital because many of the tools we use were architected before more advanced storage options like Amazon FSx existed, and expect traditional shared file systems.
Lightmatter’s infrastructure has not had any down time in the first six months, has met all their requirements, and the company has also found completely unexpected uses of the infrastructure. For example, one engineer had an urgent need to create a video rendering of a system so they could share how a new idea would look to their peers. Having access to large amounts of compute and memory in minutes saved them hours of rendering time. The engineer was able to do this all on their own without involvement from the infrastructure team. The infrastructure team actually only found out after the fact when they saw the video the engineering team had created.
Conclusion and next steps
The combination of ParallelCluster and RES has made it simple and secure for Lightmatter to deploy and manage its EDA tools and simulation workloads. The RES desktops are part of the Slurm cluster, so engineers can seamlessly run jobs on the Slurm queues. Furthermore, with the production-ready releases of RES, Slurm, and ParallelCluster, Lightmatter believes any organization could quickly set up a new environment in just a few days.
With critical workloads running all of the time, sometimes over multiple days, it can be hard to find a window to perform maintenance and upgrades on the system. In addition to the current deployment in the AWS Northern Virginia region (us-east-1), the company is also bringing up a secondary region in Oregon (us-west-2) to allow them to test upgrades, have a disaster recovery standby, and bring the EDA tools closer to engineers on the West coast.