AWS HPC Blog
Securing HPC on AWS: implementing STIGs in AWS ParallelCluster
Today, we’ll discuss cloud-native methods that HPC customers can use to accelerate their process for creating Amazon Elastic Compute Cloud (Amazon EC2) images for AWS ParallelCluster that are compliant with Security Technical Implementation Guides (STIGs), a set of standards maintained by the US government.
In this post, we’ll walk you through the process of applying STIGs to your ParallelCluster environment, help you identify the decisions you need to make on the way, and show you some of the tools you can use to make it all easier.
What’s a STIG?
STIGs are maintained by a US government organization, and are simply a set of security standards that can be applied to different environments, like Amazon EC2. Think of STIGs as a checklist of items to apply to your EC2 instances where each checklist item has a corresponding severity level attached to it that says, “the risk of not doing x is a low, medium, or high security risk”. You’ll also see these security levels referred to as Category Codes (CAT) where CAT 1 corresponds to a high security risk, CAT 2 to medium, and CAT 3 to low.
For example, one high security-risk STIG checklist item for Red Hat Enterprise Linux (RHEL) 8 is to not allow accounts configured with blank or null passwords. To resolve this, an administrator can login to the operating system and manually configure accounts to not have a blank or null password. With hundreds of checklist items it is easy to see why this can quickly become a burdensome task. The process described in this post automates up to 87% of the otherwise manual STIG remediation process.
Why do customers want to implement STIGs?
In short, some want to, and some need to. Customers such as the U.S. Department of Defense (DoD) must adhere to stringent compliance standards for operating system hardening. Other customers may prefer to use STIGs as a benchmark to improve their security posture.
Customers like the DoD often operate in AWS without any access to the Internet. Organizational policy dictates the reason for why, which is usually to reduce the risk of sensitive data going places it shouldn’t. We address how customers with these network restrictions can accelerate STIG hardening using AWS cloud native tools.
Once you’ve “STIG’d” your ParallelCluster instances, how can you verify which checklist items you have crossed off? This is where OpenSCAP, an open-source security and compliance tool, comes into play. OpenSCAP automates continuous monitoring, vulnerability management, and reporting of security policy compliance data. While OpenSCAP is primarily designed to align with DoD security standards, it’s used to establish security baselines across many industries.
This post will focus on some supported ParallelCluster operating systems (OS): RHEL8, Amazon Linux 2 (AL2), and Ubuntu 20.04 (at the time of writing this, the DISA STIG document library didn’t contain a benchmark for Ubuntu 22.04 – which is why it’s not mentioned).
We worked through the process defined in this post using the AWS GovCloud West region, but you should be able to repeat it in other AWS regions.
For HPC customers completely new to AWS, we recommend reviewing this blog post which speaks about best practices for setting up a foundation in AWS to build your HPC workloads on.
AMIs for AWS ParallelCluster
An Amazon Machine Image (AMI) is a template that contains a software configuration (for example, an OS, an application server, and applications). From an AMI, you launch an EC2 instance, which is a copy of the AMI running as a virtual server in the cloud. AMIs used for ParallelCluster are unique because they have software installed on them necessary for operating the cluster management tool.
Customers can optionally choose to create custom AMIs for ParallelCluster using two methods, both of which we can use for achieving STIG compliance, depending on factors like Internet connectivity and OS choice.
The first option is the build image configuration process which you can trigger from a ParallelCluster CLI command: pcluster build-image
. This process uses Amazon EC2 Image Builder to launch a build instance, apply the ParallelCluster cookbook, install the ParallelCluster software stack, and perform other necessary configuration tasks.
The second option involves taking a baseline ParallelCluster AMI (one produced by the ParallelCluster team themselves) and customizing it by performing manual modifications through AWS Systems Manager (SSM).
Process comparison
Should you take a baseline ParallelCluster image and then apply STIGs, or take an image that already has STIGs applied (a “golden image”), and then install ParallelCluster on top? The end result is fundamentally similar, but there are some trade-offs depending on which route you choose.
The benefit of applying STIGs after a ParallelCluster image is created is that you can minimize permissions attached to the EC2 instance’s role. There are additional AWS Identity and Access Management (IAM) permissions required to trigger the build image process and you can find them in our documentation. The tradeoff you’re making is that you would be standing up a new image build pipeline to accommodate security policy (STIG) enforcement starting from a baseline ParallelCluster image.
An advantage of taking a golden image and installing ParallelCluster is that you can maintain an already established image build pipeline that may accelerate internal compliance processes. However, this would require a wider permissions boundary in comparison to the previous example. There’s also a chance that installing new software could impact how STIG compliant your images are. For customers interested in trying this process on your own AMIs, you can follow along with any of the sections below depending on Internet connectivity and operating system requirements as the process is the same. In either case, we recommend performing compliance scans on your images.
Accelerating RHEL8, AL2, and Ubuntu 20.04 STIG compliance
Apart from the OS your use cases require, the process to achieve STIG compliance is determined by whether your Amazon EC2 instances have Internet connectivity or not. If your compliance requirements allow you the flexibility to choose, then it’s easier with Internet connectivity.
For users with Internet connectivity who want to use RHEL8 or AL2 operating systems, refer to the instructions in our GitHub repo that’s part of the HPC Recipes Library which will guide you through the EC2 Image Builder process.
For users without Internet connectivity who want to use RHEL8 or AL2 operating systems, refer to these instructions in the same repository. This type of connectivity scenario is perhaps more common amongst customers with STIG requirements. These customers can take advantage of AWS PrivateLink which is a feature of Virtual Private Cloud (VPC) and allows for private connectivity to AWS services. To take advantage of this technology for purposes of accelerating STIG compliance, ensure that you configure the required VPC endpoints to allow connectivity from your private subnet to SSM. You’ll also need the required VPC endpoints for ParallelCluster which will be used to launch your cluster with the resulting AMI.
The process for Ubuntu 20.04 includes an extra step compared to RHEL8 and AL2 operating systems because there are a couple of findings that Systems Manager cannot rectify during its run command. Due to this, we launch a baseline ParallelCluster Ubuntu 20.04 EC2 instance with a user data script that resolves findings V-219166, V-238237, and V-238218. As with RHEL8 and AL2 operating systems, customers without Internet connectivity should ensure they configure the required VPC endpoints to allow connectivity from your private subnet to SSM, and the required VPC endpoints for ParallelCluster. Instructions for Ubuntu 20.04 can be found in our HPC samples repository.
As previously mentioned, there are corresponding severity levels (high, medium, low) associated with STIG checklist items. Customers can choose which security level they want to apply to their Amazon EC2 instances which is described in our SSM documentation. We used the STIG High baseline which includes any vulnerability that can result in loss of confidentiality, availability, or integrity. Customers can optionally choose to make additional modifications to the AMIs after the STIG process of their choosing has been performed. In any event, we recommend testing AMI compatibility with your application prior to deploying to a production environment.
Results
Customers may be interested to find out what the effects of running the EC2 Image Builder STIG High component or Systems Manager STIG High document has on their respective operating systems.
We used OpenSCAP to perform compliance scanning to assess the security posture of our instances. It also uses the concept of profiles to determine which checks it will run and the profile can vary on mission requirements and OS.
For the purposes of maintaining a consistent benchmark for before and after assessments, we used the xccdf_mil.disa.stig_profile_MAC-2_Sensitive
profile for RHEL8 and Ubuntu 20.04 operating systems, and stig-rhel7-disa
on AL2.
Each of the ‘Baseline’ AMIs in the screenshots that follow refer to the baseline ParallelCluster AMI. In other words, these are the AMIs you would find by typing the CLI command: pcluster list-official-images
. Note that the baseline and subsequent STIG high AMI results may change in future ParallelCluster releases.
Running your own OpenSCAP scans
If you want to perform additional STIGs on ParallelCluster AMIs, you may want to run those images through the same OpenSCAP profiles used for this blog post.
We’ve stored the scripts for RHEL8, AL2, and Ubuntu 20.04 in our GitHub repo. These scripts do require Internet connectivity to run because they download a series of tools like the AWS CLI and OpenSCAP, and STIG benchmarks to the EC2 instance being evaluated.
You’ll need to create an S3 bucket, and update the name of the bucket inside the script where it saves the results of the evaluations. The scripts use EC2 instance metadata to dynamically name the output files in Amazon S3 after the instance, so they’re not overwritten as new instances are tested.
To run these scripts with minimal effort, you can run them as a user-data script upon launch and have the HTML results automatically sent to your S3 bucket. Inputting a user-data script follows the same logic as described under step 3 of the Ubuntu 20.04 section. For RHEL8 and Ubuntu 20.04 operating systems, it takes approximately 10 minutes from instance launch to see the results uploaded to your Amazon S3 bucket. AL2 takes approximately 20-25 minutes.
Using the resulting images
The STIG’d AMIs can be found in the EC2 section of the Management Console and referenced in a ParallelCluster configuration file. You can create clusters using the ParallelCluster CLI or the UI. For purposes of this post, we’ll show an example of placing the STIG’d AMI ID into the ParallelCluster configuration file for a cluster in GovCloud West.
Region: us-gov-west-1
Image:
Os: rhel8
HeadNode:
InstanceType: c5a.4xlarge
Networking:
SubnetId: {your-subnet-id}
Ssh:
KeyName: {your-keypair}
Image:
CustomAmi: {your-AMI-id}
SharedStorage:
- MountDir: /fsx
Name: FSxExtData
StorageType: FsxLustre
FsxLustreSettings:
StorageCapacity: 1200
DeploymentType: PERSISTENT_1
PerUnitStorageThroughput: 50
DeletionPolicy: Delete
Scheduling:
Scheduler: slurm
SlurmSettings:
QueueUpdateStrategy: DRAIN
SlurmQueues:
- Name: queue1
ComputeResources:
- Name: compute
Instances:
- InstanceType: hpc7a.48xlarge
MinCount: 1
MaxCount: 10
Efa:
Enabled: true
Networking:
SubnetIds:
- {your-subnet-id}
PlacementGroup:
Enabled: true
You should edit the Items enclosed in the {} to include your identifiers.
Once you’ve created the file you can launch the cluster using the command:
pcluster create-cluster --cluster-name <name> --cluster-configuration <file-name>.yml
You should see validation warning messages because you are using a custom AMI, however these messages can be ignored and will not impact the creation of the cluster. You can track the cluster creation status through the AWS CloudFormation console or by using the ParallelCluster CLI command:
pcluster list-clusters
Conclusion
Verifying levels of compliance for compute resources is a requirement in some industries, and desired in others. Throughout this post, we’ve discussed several different cloud-native methods HPC customers with compliance requirements can choose from to accelerate their STIG process in AWS ParallelCluster depending on their Internet connectivity (or lack thereof) and operating system choice.
We recommend validating that your output images work with your application in a development environment prior to running in production.