AWS for M&E Blog
Accelerate Thinkbox Deadline by bursting to the cloud with Amazon File Cache
As the prevalence and complexity of computer-generated images (CGI) has increased in film, TV, and commercials over the years, so has the industry’s demand for massive compute for render farms to process and render CGI elements. Render farms massively reduce render times by dividing large renders into smaller tasks and distributing those tasks among a large number of compute instances called “render nodes” for parallel processing. The effectiveness of a render farm is directly linked to its number of nodes and the total sum of its nodes’ compute power. Building a render farm at the scale required by modern VFX and animation workloads can be very expensive and difficult to maintain on-premises. Therefore, many studios augment compute resources by running renders in AWS.
AWS Thinkbox Deadline, one of the most popular tools for administering and orchestrating render farms, has built-in AWS integrations, like the Spot Event Plugin which makes managing Amazon Elastic Compute Cloud (Amazon EC2) compute resources easy. Deadline can be further expanded with custom scripts to work with services like Amazon File Cache, which a previous blog post showed can dramatically reduce render times by eliminating file transfer bottlenecks. In this blog post, we provide an overview on how to configure an existing Deadline-managed render farm to work seamlessly with an Amazon File Cache cache hydrated by an on-premises NFS file server. The cache makes sure that digital assets are available to render nodes in the cloud without disrupting existing workflows. This gives studios the flexibility to seamlessly burst into the cloud to meet their rendering needs.
Solution overview
The solution uses a Deadline Event Plugin and a Deadline Task Script to automate the data hydration, data eviction, and data write-back. Data hydration is the syncing of data to the cache from a linked data repository, such as the on-premises NFS file server. Data eviction is the process of releasing stale data from the cache. And data write-back is the syncing of new or modified files from the cache to on-premises storage. Using these concepts correctly makes sure that renders include the latest revisions of digital assets residing on-premises, and that the final rendered image is saved back on-premises.
Prerequisites
The following prerequisites are required before continuing with this post.
- Make sure that your NFS network share is accessible from your VPC using AWS Direct Connect or VPN.
- Set up your Deadline render farm to use the Spot Event Plugin to provision Amazon EC2 Spot render nodes. Refer to the Deadline Documentation for instructions.
1. Create an Amazon File Cache
Follow the instructions on this blog post on linking Amazon File Cache to on-premises file systems for guidance.
2. Connect your worker nodes to the cache
In this section, we edit the User Data of our Spot Instances to mount the cache on startup and configure Path Mapping in Deadline to automatically map file paths from the NFS file system and to the cache.
- Navigate to the Caches section of the Amazon FSx Service in the AWS Management Console and select your cache.
- Select Attach.
- Note the prerequisites and copy the mount command.
- Edit your spot fleet’s user data script to include the mount command.
- How you should edit the User Data depends on how the configuration is being managed. For example, if the configuration is being managed by an RFDK template, then the User Data should also be managed through RFDK.
- (Optional) Create a Deadline region to separate the Amazon EC2 spot nodes from the rest of the render farm.
- Navigate to Tools > Configure Repository Options > Region Settings, and select the Add button.
- Create a ruleset to apply the region to our Amazon EC2 spot nodes by navigating to Tools > Configure Repository Options > Auto Configuration, and selecting the Add button. Follow the documentation on rulesets to properly configure the ruleset.
To make sure that the render nodes look for the files on the cache instead of their original location on the cache’s data repository, we must create new path mapping rules in Deadline with the following steps:
- Open Deadline Monitor.
- Select Tools > Super User Mode to enable Super User Mode if it isn’t already on.
- Select Tools > Configure Repository Options.
- Navigate to the Path Mapping section.
- Create a new rule that maps the path to the NFS file server on the submitting client to the path of the cache on the worker nodes. Make sure to include the data repository association’s cache path.
- To support hybrid render farms, limit this path mapping to Amazon EC2 spot workers by specifying the correct region.
3. Create a Deadline Event Plugin
Amazon File Cache automatically hydrates files missing from the cache and evicts old files as the cache fills up. If files in the data repository change between renders, then an event plugin can be used to evict and rehydrate stale files on demand.
To create a new Event Plugin to evict and rehydrate files before each render, follow these steps:
- Create a folder in the “events” folder of your Deadline Repository named “FileCacheHandler”.
- Create a parameters file inside the “FileCacheHandler” folder named “FileCacheHandler.param” that specifies the editable parameters of the event plugin like the following example:
- Create a Python script inside the “FileCacheHandler” folder like the following example named “FileCacheHandler.py.” The script identifies jobs that output to the cache, attaches a post task script to them, and creates file eviction and hydration jobs as needed.
# Imports
import tempfile
import os
from Deadline.Events import DeadlineEventListener
from Deadline.Scripting import ClientUtils, RepositoryUtils
from datetime import datetime
# Functions
def GetDeadlineEventListener():
"""Called automatically by Deadline to get an instance of the File Cache Handler Listener
Returns:
FileCacheHandlerListener: An instance of the File Cache Handler Listener
"""
return FileCacheHandlerListener()
def CleanupDeadlineEventListener(eventListener):
"""Called automatically by Deadline to clean up the listener
Args:
eventListener (DeadlineEventListener): Should be the same instance created by
GetDeadlineEventListener() earlier
"""
eventListener.Cleanup()
# Classes
class FileCacheHandlerListener(DeadlineEventListener):
"""Create a new instance of the File Cache Handler Listener
Args:
DeadlineEventListener (DeadlineEventListener): The DeadlineEventListener base class that
Deadline expects
"""
def __init__(self):
self.OnJobSubmittedCallback += self.OnJobSubmitted
def Cleanup(self):
del self.OnJobSubmittedCallback
def OnJobSubmitted(self, job):
self.LogInfo("File Cache Handler: OnJobSubmitted")
paths = self.GetConfigEntryWithDefault("Paths", "")
postTask = self.GetConfigEntryWithDefault("PostTaskScript", "")
# Return early if we have empty paths or post scripts are already set
if job.JobPostTaskScript or not paths or not postTask:
return
# Check for a match with one of our paths
matchingPaths = []
for p in paths.split(";"):
path = os.path.abspath(p)
if len(path) == 0:
continue
self.LogInfo("File Cache Handler: Checking Path " + path)
for d in job.JobOutputDirectories:
outputDir = os.path.abspath(d)
try:
if os.path.commonpath([path, outputDir]) == os.path.commonpath(
[path]
):
matchingPaths.append(path.replace("\\", "/"))
except ValueError:
# os.path.commonpath() will raise error if input isn't
# part of the same root drive. We should quietly ignore
# in this case. We let other errors bubble up.
pass
if not matchingPaths or len(matchingPaths) == 0:
return
postTask = postTask.replace("\\", "/")
if self.GetBooleanConfigEntryWithDefault(
"CacheEviction", True
) or self.GetBooleanConfigEntryWithDefault("CacheHydration", True):
# If the job is not already part of a batch,
# we make it part of its own batch
if job.JobBatchName == "":
job.JobBatchName = "{name} Batch {date}".format(
name=job.JobName, date=datetime.now().isoformat()
)
self.LogInfo(
"File Cache Handler: Cache Data Eviction or Cache Data "
"Hydration Enabled. Creating Cache Handling Job"
)
oldJobInfoFilename = os.path.join(
ClientUtils.GetDeadlineTempPath(), "old_job_info.job"
)
oldPluginInfoFilename = os.path.join(
ClientUtils.GetDeadlineTempPath(), "old_plugin_info.job"
)
self.LogInfo("File Cache Handler: Creating Cache Job Files")
RepositoryUtils.CreateJobSubmissionFiles(
job, oldJobInfoFilename, oldPluginInfoFilename
)
jobInfoFilename = ""
pluginInfoFilename = ""
with tempfile.NamedTemporaryFile(
mode="w", dir=ClientUtils.GetDeadlineTempPath(), delete=False
) as jobWriter:
# Put plugin on first line
jobInfoFilename = jobWriter.name
jobWriter.write("Plugin=CommandLine\n")
with open(oldJobInfoFilename) as oldJobInfo:
self.LogInfo("File Cache Handler: Reading old cache files")
for line in oldJobInfo:
key = line.split(sep="=", maxsplit=1)[0]
if key in [
"Plugin",
"Frames",
"LimitGroups",
"OverrideJobFailureDetection",
"FailureDetectionJobErrors",
] or key.startswith("Output"):
continue
else:
jobWriter.write(line + "\n")
jobWriter.write("Frames=0\n")
jobWriter.write("FailureDetectionJobErrors=1\n")
jobWriter.write("OverrideJobFailureDetection=True\n")
with tempfile.NamedTemporaryFile(
mode="w", dir=ClientUtils.GetDeadlineTempPath(), delete=False
) as pluginWriter:
pluginInfoFilename = pluginWriter.name
pluginWriter.write("Arguments=")
for matchingPath in matchingPaths:
if self.GetBooleanConfigEntryWithDefault("CacheEviction", True):
pluginWriter.write(
'nohup find "{path}" -type d -print0 | xargs -0 -n 1 {cmd} ; '.format(
path=matchingPath, cmd="sudo lfs hsm_release"
)
)
if self.GetBooleanConfigEntryWithDefault("CacheHydration", True):
pluginWriter.write(
'nohup find "{path}" -type d -print0 | xargs -0 -n 1 {cmd} ; '.format(
path=matchingPath, cmd="sudo lfs hsm_restore"
)
)
pluginWriter.write("\n")
pluginWriter.write("Executable=bash\n")
pluginWriter.write("Shell=bash\n")
pluginWriter.write("ShellExecute=True\n")
pluginWriter.write("SingleFramesOnly=False\n")
pluginWriter.write("StartupDictionary=\n")
evictionJob = RepositoryUtils.SubmitJob(
[jobInfoFilename, pluginInfoFilename]
)
evictionJob.JobName = job.JobName + " Cache Handling"
RepositoryUtils.SaveJob(evictionJob)
job.SetJobDependencyIDs([evictionJob.JobId])
job.JobResumeOnCompleteDependencies = True
job.JobResumeOnDeletedDependencies = True
job.JobResumeOnFailedDependencies = True
RepositoryUtils.PendJob(job)
else:
self.LogInfo(
"File Cache Handler: Cache Data Eviction and Hydration "
"Disabled. Skipping to next step"
)
self.LogInfo(
"File Cache Handler: Attaching Post Task Script {script} to {job}".format(
script=postTask, job=job.JobName
)
)
job.JobPostTaskScript = postTask
RepositoryUtils.SaveJob(job)
- Create the post task script in a path accessible to the render workers, such as the Amazon File Cache. I named my example “WriteBackPostScript.py” and made it look like this:
# Imports
import subprocess
import os
from Deadline.Scripting import FrameUtils, RepositoryUtils
# Main
def __main__(deadlinePlugin, *args):
task = deadlinePlugin.GetCurrentTask()
job = deadlinePlugin.GetJob()
outputDirectories = job.OutputDirectories
outputFilenames = job.OutputFileNames
for i in range(0, len(outputDirectories)):
outputDirectory = outputDirectories[i]
outputFilename = outputFilenames[i]
for frameNum in task.TaskFrameList:
outputPath = os.path.join(outputDirectory, outputFilename)
outputPath = outputPath.replace("//", "/")
outputPath = outputPath.replace("\\", "/")
mappedOutputPath = RepositoryUtils.CheckPathMapping(outputPath)
deadlinePlugin.LogInfo(
"Mapping Path: {path} to {mappedPath}".format(
path=outputPath, mappedPath=mappedOutputPath
)
)
deadlinePlugin.LogInfo("Frame: {frameNum}".format(frameNum=frameNum))
mappedOutputPath = FrameUtils.ReplacePaddingWithFrameNumber(
mappedOutputPath, frameNum
)
deadlinePlugin.LogInfo(
"Writing back file: {path}".format(path=mappedOutputPath)
)
subprocess.run(["sudo", "lfs", "hsm_archive", mappedOutputPath])
- Select Tools > Synchronize Monitor Scripts and Plugins.
- Select Tools > Configure Events.
- Select the FileCacheHandler Event.
- Set the State to “Global Enabled”.
- Enter a comma separated list of directories in Root Paths. All files and sub-directories of root paths that match the render job are modified on cache data eviction and hydration. Therefore, the path should only encapsulate the files that could change between render submissions.
- Enable both Cache Data Eviction and Cache Data Hydration by setting both to True.
Clean up
In this section, we execute the steps necessary to delete a cache without losing any data.
- Connect to the terminal of a computer that has the Amazon File Cache cache mounted.
- Check the status of the files on the cache with the following command. This returns the names of files on the cache with changes that weren’t saved back to the data repository:
nohup find <path/to/cache> -type f -print0 | xargs -0 -n 1 sudo lfs hsm_state | awk '!/\<archived\>/ || /\<dirty\>/' | wc -l
- If the command from Step 2 shows that some files are dirty or unarchived, then run the following command to archive them individually:
sudo lfs hsm_archive <path/to/unarchived/file>
- To archive an entire folder, run the following command:
nohup find <path/to/folder> -type f -print0 | xargs -0 -n 1 sudo lfs hsm_archive &
- To archive an entire folder, run the following command:
- Repeat steps 2 and 3 until no more files have unsaved changes. This may take some time depending on how much data must be written back.
- Navigate to the Caches section of the Amazon FSx Service in the Console.
- Select your cache.
- Select Actions > Delete cache.
- Type the name of your cache and select Delete.
- Remove the lines pertaining to the cache from the user data of your spot workers.
- Select Tools > Configure Events.
- Select the FileCacheHandler Event.
- Disable the Event Plugin by setting the State to “Disabled” and select OK.
- Select Tools > Configure Repository Options in the Deadline Monitor.
- Select Mapped Paths.
- Select any rules pertaining to your cache, and select Remove.
Conclusion
In this post, we explored a solution that uses Amazon File Cache and Deadline scripts to automatically sync digital assets between on-premises storage and a cloud cache. This friction-less integration between AWS and on-premises hardware lets studios preserve their current workflows, while maintaining the ability to instantly tap into the power of the AWS cloud when they must accelerate their rendering past the limitations of their on-premises compute capacity. We hope that this solution and solutions like it empower our customers to continue to push the boundaries of what’s possible in the world of computer-generated graphics.