AWS Big Data Blog
Build, deploy, and run Spark jobs on Amazon EMR with the open-source EMR CLI tool
Today, we’re pleased to introduce the Amazon EMR CLI, a new command line tool to package and deploy PySpark projects across different Amazon EMR environments. With the introduction of the EMR CLI, you now have a simple way to not only deploy a wide range of PySpark projects to remote EMR environments, but also integrate with your CI/CD solution of choice.
In this post, we show how you can use the EMR CLI to create a new PySpark project from scratch and deploy it to Amazon EMR Serverless in one command.
Overview of solution
The EMR CLI is an open-source tool to help improve the developer experience of developing and deploying jobs on Amazon EMR. When you’re just getting started with Apache Spark, there are a variety of options with respect to how to package, deploy, and run jobs that can be overwhelming or require deep domain expertise. The EMR CLI provides simple commands for these actions that remove the guesswork from deploying Spark jobs. You can use it to create new projects or alongside existing PySpark projects.
In this post, we walk through creating a new PySpark project that analyzes weather data from the NOAA Global Surface Summary of Day open dataset. We’ll use the EMR CLI to do the following:
- Initialize the project.
- Package the dependencies.
- Deploy the code and dependencies to Amazon Simple Storage Service (Amazon S3).
- Run the job on EMR Serverless.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account
- An EMR Serverless application in the
us-east-1
Region - An S3 bucket for your code and logs in the
us-east-1
Region - An AWS Identity and Access Management (IAM) job role that can run EMR Serverless jobs and access S3 buckets
- Python version >= 3.7
- Docker
If you don’t already have an existing EMR Serverless application, you can use the following AWS CloudFormation template or use the emr bootstrap
command after you’ve installed the CLI.
Install the EMR CLI
You can find the source for the EMR CLI in the GitHub repo, but it’s also distributed via PyPI. It requires Python version >= 3.7 to run and is tested on macOS, Linux, and Windows. To install the latest version, use the following command:
You should now be able to run the emr --help
command and see the different subcommands you can use:
If you didn’t already create an EMR Serverless application, the bootstrap
command can create a sample environment for you and a configuration file with the relevant settings. Assuming you used the provided CloudFormation stack, set the following environment variables using the information on the Outputs tab of your stack. Set the Region in the terminal to us-east-1
and set a few other environment variables we’ll need along the way:
We use us-east-1
because that’s where the NOAA GSOD data bucket is. EMR Serverless can access S3 buckets and other AWS resources in the same Region by default. To access other services, configure EMR Serverless with VPC access.
Initialize a project
Next, we use the emr init
command to initialize a default PySpark project for us in the provided directory. The default templates create a standard Python project that uses pyproject.toml
to define its dependencies. In this case, we use Pandas and PyArrow in our script, so those are already pre-populated.
After the project is initialized, you can run cd my-project
or open the my-project
directory in your code editor of choice. You should see the following set of files:
Note that we also have a Dockerfile here. This is used by the package
command to ensure that our project dependencies are built on the right architecture and operating system for Amazon EMR.
If you use Poetry to manage your Python dependencies, you can also add a --project-type poetry
flag to the emr init
command to create a Poetry project.
If you already have an existing PySpark project, you can use emr init --dockerfile
to create the Dockerfile necessary to package things up.
Run the project
Now that we’ve got our sample project created, we need to package our dependencies, deploy the code to Amazon S3, and start a job on EMR Serverless. With the EMR CLI, you can do all of that in one command. Make sure to run the command from the my-project
directory:
This command performs several actions:
- Auto-detects the type of Spark project in the current directory.
- Initiates a build for your project to package up dependencies.
- Copies your entry point and resulting build files to Amazon S3.
- Starts an EMR Serverless job.
- Waits for the job to finish, exiting with an error status if it fails.
You should now see the following output in your terminal as the job begins running in EMR Serverless:
And that’s it! If you want to run the same code on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), you can replace --application-id
with --cluster-id j-11111111
. The CLI will take care of sending the right spark-submit
commands to your EMR cluster.
Now let’s walk through some of the other commands.
emr package
PySpark projects can be packaged in numerous ways, from a single .py file to a complex Poetry project with various dependencies. The EMR CLI can help consistently package your projects without having to worry about the details.
For example, if you have a single .py file in your project directory, the package
command doesn’t need to do anything. If, however, you have multiple .py files in a typical Python project style, the emr package
command will zip these files up as a package that can later be uploaded to Amazon S3 and provided to your PySpark job using the --py-files
option. If you have third party dependencies defined in pyproject.toml
, emr package
will create a virtual environment archive and start your EMR job with the spark.archive
option.
The EMR CLI also supports Poetry for dependency management and packaging. If you have a Poetry project with a corresponding poetry.lock
file, there’s nothing else you need to do. The emr package
command will detect your poetry.lock
file and automatically build the project using the Poetry Bundle plugin. You can use a Poetry project in two ways:
- Create a project using the
emr init
command. The commands take a--project-type
poetry option that create a Poetry project for you: - If you have a pre-existing project, you can use the
emr init --dockerfile
option, which creates a Dockerfile that is automatically used when you runemr package
.
Finally, as noted earlier, the EMR CLI provides you a default Dockerfile based on Amazon Linux 2 that you can use to reliably build package artifacts that are compatible with different EMR environments.
emr deploy
The emr deploy
command takes care of copying the necessary artifacts for your project to Amazon S3, so you don’t have to worry about it. Regardless of how the project is packaged, emr deploy
will copy the resulting files to your Amazon S3 location of choice.
One use case for this is with CI/CD pipelines. Sometimes you want to deploy a specific version of code to Amazon S3 to be used in your data pipelines. With emr deploy
, this is as simple as changing the --s3-code-uri
parameter.
For example, let’s assume you’ve already packaged your project using the emr package
command. Most CI/CD pipelines allow you to access the git tag. You can use that as part of the emr deploy
command to deploy a new version of your artifacts. In GitHub actions, this is github.ref_name
, and you can use this in an action to deploy a versioned artifact to Amazon S3. See the following code:
In your downstream jobs, you could then update the location of your entry point files to point to this new location when you’re ready, or you can use the emr run
command discussed in the next section.
emr run
Let’s take a quick look at the emr run
command. We’ve used it before to package, deploy, and run in one command, but you can also use it to run on already-deployed artifacts. Let’s look at the specific options:
If you want to run your code on EMR Serverless, the emr run
command takes an --application-id
and --job-role
parameters. If you want to run on EMR on EC2, you only need the --cluster-id
option.
Required for both options are --entry-point
and --s3-code-uri
. --entry-point
is the main script that will be called by Amazon EMR. If you have any dependencies, --s3-code-uri
is where they get uploaded to using the emr deploy command, and the EMR CLI will build the relevant spark-submit properties pointing to these artifacts.
There are a few different ways to customize the job:
- –job-name – Allows you to specify the job or step name
- –job-args – Allows you to provide command line arguments to your script
- –spark-submit-opts – Allows you to add additional
spark-submit
options like--conf spark.jars
or others - –show-stdout – Currently only works with single-file .py jobs on EMR on EC2, but will display
stdout
in your terminal after the job is complete
As we’ve seen before, --build
invokes both the package
and deploy
commands. This makes it easier to iterate on local development when your code still needs to run remotely. You can simply use the same emr run
command over and over again to build, deploy, and run your code in your environment of choice.
Future updates
The EMR CLI is under active development. Updates are currently in progress to support Amazon EMR on EKS and allow for the creation of local development environments to make local iteration of Spark jobs even easier. Feel free to contribute to the project in the GitHub repository.
Clean up
To avoid incurring future charges, stop or delete your EMR Serverless application. If you used the CloudFormation template, be sure to delete your stack.
Conclusion
With the release of the EMR CLI, we’ve made it easier for you to deploy and run Spark jobs on EMR Serverless. The utility is available as open source on GitHub. We’re planning a host of new functionalities; if there are specific requests you have, feel free to file an issue or open a pull request!
About the author
Damon is a Principal Developer Advocate on the EMR team at AWS. He’s worked with data and analytics pipelines for over 10 years and splits his team between splitting service logs and stacking firewood.