AWS Cloud Operations Blog

Automating routine cloud operations with AWS Systems Manager and MontyCloud

IT administrators and DevOps engineers often perform routine operations to manage their cloud infrastructure and modern cloud workloads. Such tasks are considered as Day-2 tasks as they generate routine outcomes for the organization. Customers often use Python scripts to perform such tasks. Creating and managing the required computing environment for Python scripts, along with ongoing administration overheads for security and traceability, is a growing challenge.

This blog post describes how LeadSquared, a born-in-the-cloud customer, has simplified their cloud operations with MontyCloud, powered by AWS Systems Manager.

 

Overview

With the recently launched AWS Systems Manager Automation’s ScriptExecution action, customers can easily refer to any Python script within their accounts and execute it. Python scripts can be embedded in Automation Documents and executed without the need for additional compute resources. Customers will need to enable required permissions and make documents available to each AWS Account and Region they would like to manage.

MontyCloud’s Cloud Management Platform has extended this capability and made the functionality readily available across multiple AWS accounts and Regions, and in the context of customer IT departments and cloud applications. Customers can easily upload and manage a catalog of Python scripts and convert scripts into reusable tasks. Both routine operations and occasional break-glass tasks for managing unplanned operations can be easily automated with MontyCloud’s user interface and extensible APIs. MontyCloud has also enabled a simple Role-Based Access Control model to securely enable self-service capabilities. In just a few clicks, administrators can resolve issues and deliver efficient Day-2 operations.

Customers like LeadSquared demand a high degree of automation in cloud operations and better efficiency in enabling modern DevOps. Before the launch of this feature, LeadSquared was creating and managing several Lambda functions to accomplish their goals. LeadSquared also had to perform additional overhead tasks to centrally manage, provision, and secure custom Python scripts. To enable access to various users for self-service capabilities, LeadSquared had to use a hands-on and cumbersome ticket-based approach that wasn’t scalable.

To better manage their growing AWS Cloud infrastructure and mission-critical applications, LeadSquared was looking to improve the efficiency of their Day-2 operations. Typical Day-2 tasks performed by LeadSquared can be classified into the following three categories:

  • Routine recurring tasks
  • On-demand day2 operations
  • Tasks required for critical break-glass scenarios

MontyCloud has empowered cloud IT teams in LeadSquared to perform operations in all these categories.  Powered by native AWS Services such as AWS Systems Manager, AWS Lambda, AWS CloudFormation, and Amazon SQS and, in particular, by leveraging AWS System’s Manager’s ScriptExecution action.

 

A routine task: fetching Amazon CloudWatch metrics for a mission-critical cloud application

LeadSquared uses Amazon SQS queues in their distributed application to decouple tasks and execute units of work. Their application uses over 100 queues and is provisioned to dynamically and horizontally scale.

A routine task was performed to track every queue individually, and analyze their usage and performance throughout the day, taking over two hours to complete. To optimize this operation, LeadSquared developed a Python script that pulls data from Amazon CloudWatch for querying multiple queues and aggregates the data. The script also analyzes the aggregated data and reports critical metrics that help assess performance of both individual queues and the overall application.

This script was configured as an AWS Lambda function and triggered via a CloudWatch Rule that was configured to run daily. LeadSquared used MontyCloud to convert this script into a centrally managed task and made it available to multiple users in their environment.

 

An On-Demand Instance task: cloning an Amazon Aurora database

LeadSquared uses Amazon Aurora databases to power several critical cloud applications. When an application experiences performance issues, they create a clone of their Aurora DB on the AWS Management Console to perform analysis and develop fixes. With the introduction of the new AWS Systems Manager feature, LeadSquared have automated this operation by designing a Python script that creates a clone of their Aurora DB and have exposed it as an On-Demand Instance task via MontyCloud.

 

A break-glass scenario: managing a firewall

LeadSquared uses a third-party Web Application Firewall (WAF), behind which over 15 of their web applications and RESTful web services are configured. The traffic first traverses the firewall and then their network. A few months ago, due to an inadvertent configuration change, the tool malfunction, which caused a significant impact on LeadSquared, as their critical cloud applications were inaccessible. It took LeadSquared over 20 minutes to reroute traffic away from the firewall and directly to their network because this task had to be manually performed for each application.

Since that incident, LeadSquared has designed a Python script that easily redirects traffic away from the firewall and directly to their network by modifying the DNS entries in Route 53.

Before this feature was available from Systems Manager, custom Python scripts were configured as AWS Lambda functions accessed through API gateways. Due to the critical nature of some tasks, IT administrators had to protect details of API endpoints and were unable to decentralize tasks. With this feature, customers no longer have to manage custom Lambda or API gateways. Instead, they simply upload their script and turn them into highly reusable automations.

 

Conclusion

With MontyCloud’s platform, LeadSquared automated the operations in all three cases without provisioning compute environments or writing additional custom code. The DevOps team routinely creates and uploads custom Python scripts into MontyCloud’s platform interface, as seen in the following screenshot.

 

They save several hours each week with this feature while avoiding creating and managing compute environments or additional IAM roles. LeadSquared has also enabled self-service remediations for their internal users without giving away control.

Additionally, MontyCloud customers can perform routine operations to de-register AMIs, Schedule DynamoDB Backups, Copy Objects between S3 buckets, and perform any Day-2 operations easily. To learn more about this feature and about MontyCloud’s intelligent Cloud Management Platform, please go to https://www.montycloud.com.

 

About the Author

Varsha Mallya is a Senior Platform Engineer at MontyCloud. Varsha enjoys identifying customer value proposition, defining end-to-end user stories, and developing customer-centric UX. In her spare time, Varsha enjoys exploring different genres of music and learning new technologies.

MontyCloud is a Seattle, WA based intelligent Cloud Management Platform Company.  MontyCloud’s Cloud Management Platform helps traditional IT departments transform into cloud power houses.

The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.