Parallelizing S3 Workloads with s5cmd

This open source project comes from our customer community. It was developed by Peak Games to assist with their own S3 workflow, and includes features such as tab completion and built-in wild card support for files in S3 commands. Enjoy!

– Deirdré

Background

Up until now, working on multiple objects on Amazon S3 from the command line meant invoking multiple commands, or using wildcards, with the tools that supported them to some extent. Each command invocation is another fork/exec on the system level, whose overhead adds up when you need to run a few hundred or more operations.

The Tool

s5cmd lets you run multiple operations (with wildcards or not) using a single executable invocation. For example, if you have to delete (or copy) a few million objects, you don’t have to invoke the CLI tool a few million times. By piping the commands into s5cmd, you invoke the tool just once and let it run a few hundred workers to do the given work.

Since s5cmd already has a worker pool, wildcard operations can be accomplished using a single worker for the ListObjects call (which can match further wildcards), then let other workers do the actual processing. It also supports shell autocompletion for bash and zsh, so if you’d like to use it as a more conventional CLI tool, you can just hit TAB and let it autocomplete options, buckets, and paths/objects for you.

Installation

Install s5cmd on Mac OS X:

$ brew tap peakgames/s5cmd https://github.com/peakgames/s5cmd
$ brew install s5cmd

The tool is written in Go, other platforms can compile and install it using:

$ go get -u github.com/peakgames/s5cmd

Set up credentials just as you would for the awscli tool: Use the ~/.aws/credentials file or environment variables, or a combination of both. (If you’re running on EC2, roles are also supported.)

Usage

Commands are in “command [command options] argument1 [argument2]” format. s5cmd also takes options, which affect all commands run. To get the list of s5cmd options:

$ s5cmd -help

To get a list of available commands, run without arguments:

$ s5cmd

s5cmd in Action

Say we have a bucket named “reports-bkt”, and we have some files inside. First, let’s download one:

$ s5cmd get s3://reports-bkt/a/2018/03/14/reports_19_13716285583145.csv.gz
                     # Downloading reports_19_13716285583145.csv.gz...
2018/03/21 11:46:05 +OK "get s3://reports-bkt/a/2018/03/14/reports_19_13716285583145.csv.gz ./reports_19_13716285583145.csv.gz"

Now, let’s scan all of last month’s CSV reports and match each day’s report with a wildcard:

$ s5cmd du -g -h s3://reports-bkt/a/2018/02/*/reports*csv.gz
                            + 10.7M bytes in 367 objects: s3://reports-bkt/a/2018/02/*/reports*csv.gz [STANDARD]
2018/03/21 11:46:24 +OK "du s3://reports-bkt/a/2018/02/*/reports*csv.gz" (1)

Looks like there are 367 reports for the whole month, taking up about 10MB, all using standard storage. Let’s download them all:

$ s5cmd cp --parents s3://reports-bkt/a/2018/02/*/reports*csv.gz .

Using the --parents option, each day of the month will be downloaded to its own directory. (This option creates directory structure starting from the first wildcard specified.)

These examples are just the tip of the iceberg. Piping commands using a file or another command’s output is another option.

You might have noticed that our bucket has the usual “letter prefix” scheme. Let’s say you want to download all files for a given date, for all prefixes. This structure is something like:

- a/[yyyy]/[mm]/[dd]/object_unique_id.gz
- b/[yyyy]/[mm]/[dd]/object_unique_id.gz
- c/[yyyy]/[mm]/[dd]/object_unique_id.gz
... up to ...
- z/[yyyy]/[mm]/[dd]/object_unique_id.gz

If you have hundreds of days and billions of objects, specifying the wildcard at the first level won’t really work. Since you already know the range (letters a to z), you can generate commands for each of the prefixes. Invoke the tool just once, and let it do the work. Try this:

$ for X in {a..z}; do echo get -n s3://reports-bkt/${X}/2018/03/14/reports*csv.gz; done | s5cmd -f -
2018/03/21 11:48:03 # Stats: Total             379 281 ops/sec 1.350311978s

The first command will generate a bunch of “get” commands, then pass the commands to s5cmd to do the work. Notice that we’ve used the “-n” (no-clobber) option to prevent overwriting if the object names are not really unique – we can’t use the --parents option because the wildcard is not in the directory name. You can see how many operations were done (and how much time it took) by checking the stat counters.

Contributing

All contributions to the project are welcome, and managed using the issue tracker at github.com/peakgames/s5cmd. If you are going to submit a PR, we suggest you open an issue first to discuss it with the team.

This is a guest post from Peak Games, which leverages S3 as part of a comprehensive pipeline that distills data into knowledge, further enhancing user experience of their world-class mobile games.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

AWS Open Source Blog

Parallelizing S3 Workloads with s5cmd

Background

The Tool

Installation

Usage

s5cmd in Action

Contributing

Resources

Follow

Learn

Resources

Developers

Help