AWS Startups Blog
Distributed Job Scheduling for AWS
Guest post by Rajat Bhargava, JumpCloud
Recently, Medium CTO Don Neufeld highlighted the need for a distributed job scheduling system on AWS. As more and more companies use AWS for their infrastructure, they need tools to execute tasks and schedule those tasks across the server infrastructure. A key component of that process is orchestrating server tasks across an AWS implementation. Historically, this has been done through scripting and cron.
While that may still work, a better system is needed. As more workloads move to the cloud and organizations focus on a one server, one task model enabling easy horizontal scaling, the need for a distributed scheduling system is more critical than ever. Unfortunately, executing a set of tasks across a wide-area server infrastructure is still not easy. Examples of issues that occur include managing access to devices, orchestrating a sequence of tasks, and building Boolean logic into task execution. Juggling these problems forces organizations into manual execution of tasks (or at minimum some manual involvement), which greatly diminishes the leverage that the cloud provides.
In this post, we’ll analyze three different options for executing a set of distributed jobs across an AWS infrastructure: scripts and cron, open source tools, and finally one commercial option.
Scripts and Cron
One way to handle a set of distributed tasks is to leverage cron to build a schedule of execution tasks. A tried and true tool since the 1970s, cron ends up being the core of scheduling in the *NIX world. Unfortunately, cron doesn’t have the concept of distributed job execution. Cron is built for executing tasks on one particular server. Also, visibility with cron leaves a lot to be desired. Unless you are willing to write more code around cron, you don’t get reports that tell you whether or not a job was completed successfully.
To the extent that you want to create a “distributed” scheduling process, you’ll need to write code or at least sequence your events properly. Coding is required if you want to chain events together and ensure that each was completed before the next step is kicked off. Admins end up mixing scripting and cron and go as far as they think is practical with scripting before turning to the good, old-fashioned manual execution.
Complex scenarios take time to code. Although it may not be pretty—most of the time you get a one-off solution that’s difficult to reuse—it does end up working. The challenge with this approach is that it is relatively fragile; if things don’t go just right, the system can’t adjust. Although this approach is the most accessible to admins, it’s no wonder that Dan wants something better.
Open Source Alternatives
Two open source approaches that can be leveraged to execute a set of distributed tasks are Chronos and Luigi. Chronos is a distributed execution system meant to replace cron. It is also fault tolerant and lives on top of Mesos, the Apache cluster manager. With Chronos, you can schedule a pipeline of tasks across your entire infrastructure, wherever it may live. The system is able to execute tasks based on previously completed ones, and includes a mechanism to notify Chronos of individual task failures. The blog post announcing Chronos touts other benefits:
“Chronos … allows you to schedule your jobs using ISO8601 repeating interval notation, which enables more flexibility in job scheduling. Chronos also supports the definition of jobs triggered by the completion of other jobs, and it also supports arbitrarily long dependency chains.”
Although Chronos is a significant step up over manual scripts or cron, it still requires some manual work to implement. Further, because Chronos requires Apache Mesos to manage communications and resource allocation, it requires the installation and configuration of Mesos throughout your network.
Another open source system that can handle a pipeline of batch jobs is Luigi. Luigi is Python-based and, like Chronos, can handle dependencies and errors. Like Chronos, the impetus for Luigi was to handle a complex set of database or data manipulation tasks. Luigi does have native support for some database tasks such as MapReduce jobs in Hadoop.
The JumpCloud Option
Commercial entities are beginning to recognize the critical problem of building task workflows in today’s cloud environments. Although some of the world’s largest software makers offer enterprise software for executing complex pipelines of tasks, we are going to focus on a SaaS-based solution that works closely with AWS called JumpCloud. JumpCloud, based on AWS infrastructure, syncs instance IDs to ensure that you know which EC2 instances you are working with when executing tasks across your infrastructure. JumpCloud is an AWS partner and Activate sponsor.
AWS customers are building complex infrastructures and then trying to automate the management and execution of infrastructure tasks. This is exactly the problem JumpCloud is trying to solve. We call it server management, although it just as easily could be called server orchestration, job scheduling, task and workflow automation, database / data manipulation, or any number of other things.
How JumpCloud Automates Complex Workflows with Server Orchestration Tool
With JumpCloud, you can easily build a complex workflow of tasks. You treat tasks like building blocks that you chain together or chain to multiple others. Webhooks can trigger events to start. You can also “join” or “split” tasks so that you can leverage distributed and scaled infrastructure. For example, if you need all your database servers to finish indexing a table before you can execute a report, the JumpCloud join feature ensures that all your database servers are done indexing before moving to the next step. JumpCloud can also let you execute n number of processes as one step.
For example, if you’ve got logs on multiple EC2 instances that you want to clean up each time right before you restart your web servers (running on a different set of EC2 instances), you can do so easily with JumpCloud. In the diagram below, you can see how you define a RotateLogs trigger, which executes against one set of servers. As that log rotate job completes across the EC2 instances, the next job, named “WebServerRestart” can start. JumpCloud takes care of waiting for all the “RotateLogs” jobs to complete before starting the next step. While this is a very simple and straightforward example, you can create workflows across your EC2 instances that are as complex as you need.
JumpCloud’s functionality is powerful and can help you automate a whole workflow quickly and easily. The benefit of JumpCloud is that you won’t have to write the plumbing and manage the execution of your tasks.
As an AWS Activate partner, you can try JumpCloud for free. If you are an Activate member, you can get a 60-day free trial. And if you are a Portfolio member of Activate, you get 90 days free.
Creating, managing, and executing a series of jobs across your AWS infrastructure can be a daunting task. Most admins have taken the approach of leveraging cron and perhaps scripting some code around that. Others prefer to implement open source alternatives. For those that are interested in a commercial alternative, take a look at AWS Activate partner JumpCloud.