Optimizing Amazon Elastic Container Service for cost using scheduled scaling

Elasticity and cost have always been major factors in improving the operational efficiency of organizations, which in turn drives business transformation and agility. Elasticity is defined as the ability of the infrastructure (including application) to be able to seamlessly scale out and scale in based on the load. This is also called auto scaling. If the scale out/in happens based on a schedule, it is called scheduled auto scaling. This is critical for all our customers who spin up resources at the start of their activity and spin them down at the end of it. This not only helps in effectively managing the extra load of the system during peak times but also directly impacts cost as the extra infrastructure is scaled down when not in use. This post will combine Amazon Elastic Container Service (Amazon ECS) scheduled scaling with capacity provider and Spot integration to come up with a simple strategy/guidance for cost optimization.

What is an ECS capacity provider?

Amazon ECS capacity providers enable you to manage the infrastructure that the tasks in your clusters use. Each cluster can have one or more capacity providers and an optional default capacity provider strategy. The capacity provider strategy determines how the tasks are spread across the capacity providers in a cluster. When you run a task or create a service, you may use the cluster’s default capacity provider strategy or specify a capacity provider strategy that overrides the cluster’s default strategy.

ECS Capacity providers consist of the following components – capacity provider and capacity provider strategy.

A capacity provider is used in association with a cluster to determine the infrastructure that a task runs on. For Amazon ECS on Fargate users, the FARGATE and FARGATE_SPOT capacity providers are provided automatically. For more information, see using AWS Fargate capacity providers. For Amazon ECS on Amazon EC2 users, a capacity provider consists of a name, an Auto Scaling group, and the settings for managed scaling and managed termination protection. This type of capacity provider is used in cluster auto scaling. For more information, see Auto Scaling group capacity providers. One or more capacity providers are specified in a capacity provider strategy, which is then associated with a cluster as well as a service.

A capacity provider strategy gives you control over how your tasks use one or more capacity providers. When you run a task or create a service, you specify a capacity provider strategy. A capacity provider strategy consists of one or more capacity providers with an optional base and weight specified for each provider. The base value designates how many tasks, at a minimum, to run on the specified capacity provider. Only one capacity provider in a capacity provider strategy can have a base defined. The weight value designates the relative percentage of the total number of launched tasks that should use the specified capacity provider. For example, if you have a strategy that contains two capacity providers, and both have a weight of 1, then after the base is satisfied, the tasks will be split evenly across the two capacity providers. Using that same logic, if you specify a weight of 1 for capacityProviderA and a weight of 4 for capacityProviderB, then for every one task that is run using capacityProviderA, four tasks would use capacityProviderB.

A default capacity provider strategy is associated with each Amazon ECS cluster. This determines the capacity provider strategy the cluster will use if no other capacity provider strategy or launch type is specified when running a task or creating a service.

Solution Overview

Infrastructure setup
[Note: All commands below are run in us-east-2. Please update the region accordingly as per your specific requirements]

Save the CloudFormation script below in a file called ecs-cp-infra.yaml.

AWSTemplateFormatVersion: 2010-09-09
Description: This template creates an empty ECS cluster along with a Spot and OnDemand Capacity provider 
Parameters: 
  InstanceType:
    Type: String
    Default: t2.small
    AllowedValues: 
      - t2.micro
      - t2.small
      - m4.large
    Description: Enter t2.micro, t2.small, or m4.large. Default is t2.small

  ECSAMI:
    Description: AMI ID
    Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
    Default: /aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id

  ClusterName:
    Description: Cluster Name
    Type: String
    Default: SchTestCluster

Resources: 
  myVPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsSupport: true
      EnableDnsHostnames: true

  InternetGateway:
    Type: AWS::EC2::InternetGateway
    Properties:
      Tags:
        - Key: Name
          Value: Test

  InternetGatewayAttachment:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      InternetGatewayId: !Ref InternetGateway
      VpcId: !Ref myVPC

  mySubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId:
        Ref: myVPC
      CidrBlock: 10.0.0.0/24
      MapPublicIpOnLaunch: true

  PublicRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref myVPC

  DefaultPublicRoute:
    Type: AWS::EC2::Route
    DependsOn: InternetGatewayAttachment
    Properties:
      RouteTableId: !Ref PublicRouteTable
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId: !Ref InternetGateway

  mySubnet1RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PublicRouteTable
      SubnetId: !Ref mySubnet1

  InstanceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
        GroupDescription: Allow http and https to client host
        VpcId:
           Ref: myVPC
        SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0

  EcsServiceLinkedRole:
    Type: 'AWS::IAM::ServiceLinkedRole'
    Properties:
      AWSServiceName: ecs.amazonaws.com
      Description: "Role to enable Amazon ECS to manage your cluster."

  ecsInstanceRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
        - Effect: Allow
          Principal:
            Service:
            - ec2.amazonaws.com
          Action:
          - sts:AssumeRole
      Path: "/"

  RolePolicies:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: ecsInstance
      PolicyDocument:
        Statement:
        - Effect: Allow
          Action:
          - ec2:DescribeTags
          - ecs:CreateCluster
          - ecs:DeregisterContainerInstance
          - ecs:DiscoverPollEndpoint
          - ecs:Poll
          - ecs:RegisterContainerInstance
          - ecs:StartTelemetrySession
          - ecs:UpdateContainerInstancesState
          - ecs:Submit*
          - ecr:GetAuthorizationToken
          - ecr:BatchCheckLayerAvailability
          - ecr:GetDownloadUrlForLayer
          - ecr:BatchGetImage
          - logs:CreateLogStream
          - logs:PutLogEvents
          Resource: "*"
      Roles:
      - Ref: ecsInstanceRole

  ecsInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Path: "/"
      Roles:
      - Ref: ecsInstanceRole

  ecsTaskExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: schTestEcsTaskExecRole
      AssumeRolePolicyDocument:
        Statement:
        - Effect: Allow
          Principal:
            Service: [ecs-tasks.amazonaws.com]
          Action: ['sts:AssumeRole']
      Path: /
      Policies:
        - PolicyName: AmazonECSTaskExecutionRolePolicy
          PolicyDocument:
            Statement:
            - Effect: Allow
              Action:
                # Allow the ECS Tasks to download images from ECR
                - 'ecr:GetAuthorizationToken'
                - 'ecr:BatchCheckLayerAvailability'
                - 'ecr:GetDownloadUrlForLayer'
                - 'ecr:BatchGetImage'

                # Allow the ECS tasks to upload logs to CloudWatch
                - 'logs:CreateLogStream'
                - 'logs:PutLogEvents'
              Resource: '*'

  MyCluster:
    Type: 'AWS::ECS::Cluster'
    Properties:
      ClusterName:
        Ref: "ClusterName"

  OnDemandConfig:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      ImageId: 
        Ref: "ECSAMI"
      SecurityGroups:
        - Ref: "InstanceSecurityGroup"
      IamInstanceProfile:
        Ref: "ecsInstanceProfile"
      UserData:
        Fn::Base64:
          !Sub |
            #!/bin/bash
            echo ECS_CLUSTER=${ClusterName} >> /etc/ecs/ecs.config
      InstanceType:
        Ref: "InstanceType"
      BlockDeviceMappings:
      - DeviceName: "/dev/sdk"
        Ebs:
          VolumeSize: '50'
      - DeviceName: "/dev/sdc"
        VirtualName: ephemeral0

  OnDemandServerGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
        - Ref: "mySubnet1"
      LaunchConfigurationName:
        Ref: OnDemandConfig
      MinSize: '1'
      MaxSize: '3'

  SpotConfig:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      ImageId: 
        Ref: "ECSAMI"
      SecurityGroups:
        - Ref: "InstanceSecurityGroup"
      IamInstanceProfile:
        Ref: "ecsInstanceProfile"
      UserData:
        Fn::Base64:
          !Sub |
            #!/bin/bash
            echo ECS_CLUSTER=${ClusterName} >> /etc/ecs/ecs.config
      InstanceType:
        Ref: "InstanceType"
      SpotPrice: "0.05"

  SpotServerGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
        - Ref: "mySubnet1"
      LaunchConfigurationName:
        Ref: SpotConfig
      MinSize: '1'
      MaxSize: '3'

  OnDemandCapacityProvider:
    Type: AWS::ECS::CapacityProvider
    Properties:
        Name: OnDemandCapProvider
        AutoScalingGroupProvider:
            AutoScalingGroupArn:
              Ref: OnDemandServerGroup
            ManagedScaling:
                MaximumScalingStepSize: 10
                MinimumScalingStepSize: 1
                Status: ENABLED
                TargetCapacity: 100
        Tags:
            - Key: environment
              Value: test

  SpotCapacityProvider:
    Type: AWS::ECS::CapacityProvider
    Properties:
        Name: SpotCapProvider
        AutoScalingGroupProvider:
            AutoScalingGroupArn:
              Ref: SpotServerGroup
            ManagedScaling:
                MaximumScalingStepSize: 10
                MinimumScalingStepSize: 1
                Status: ENABLED
                TargetCapacity: 100
        Tags:
            - Key: environment
              Value: test

2. Run the CloudFormation template ecs-cp-infra.yaml [Check the deployment status in the CloudFormation Console]

aws cloudformation create-stack --stack-name cp-cap-provider-stack \ --template-body file://ecs-cp-infra.yaml \ --parameters ParameterKey=InstanceType,ParameterValue=t2.small \ ParameterKey=ECSAMI,ParameterValue=/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id \ ParameterKey=ClusterName,ParameterValue=SchTestCluster \ --capabilities CAPABILITY_NAMED_IAM --region us-east-2

Creates a VPC with a public subnet
Creates an empty ECS cluster
Creates an On-Demand auto scaling group (ASG)
Creates a Spot ASG
Creates an On-Demand capacity provider
Creates a Spot capacity provider

3. Associate the Spot and On-Demand capacity providers with the ECS cluster and set a default capacity provider strategy on the ECS cluster

aws ecs put-cluster-capacity-providers \ --cluster SchTestCluster \ --capacity-providers SpotCapProvider OnDemandCapProvider \ --default-capacity-provider-strategy capacityProvider=SpotCapProvider,weight=1,base=2 capacityProvider=OnDemandCapProvider,weight=1 \ --region us-east-2

base of 2 for the Spot capacity provider and a weight of 1 [Provider1]
base of 0 for the On-Demand capacity provider and a weight of 1 [Provider2]

Note:

A base value of 2 ensures that the first 2 tasks are always started on Spot instances. A weight of 1 equally distributes the remaining tasks between the Spot and On-Demand capacity providers.
You can also use a custom strategy with On-Demand capacity provider [Provider1] with a base of 2 and weight 1 and Spot capacity provider [Provider2] with weight of 1

4. [Optional – Only required if the task definition does not exist] Save the json below in a file called demo-sleep-taskdef.json.

{
    "family": "demo-sleep-taskdef",
    "containerDefinitions": [
        {
            "name": "sleep",
            "image": "amazonlinux:2",
            "memory": 20,
            "essential": true,
            "command": [
                "sh",
                "-c",
                "sleep infinity"] 
        }],
    "requiresCompatibilities": [
        "EC2"] 
}

aws ecs register-task-definition --cli-input-json file://demo-sleep-taskdef.json \
--region us-east-2

5. Create a service [it will be created with the Cluster’s Default Capacity Provider Strategy]

aws ecs create-service \ --cluster SchTestCluster \ --service-name SchTestService \ --task-definition demo-sleep-taskdef \ --desired-count 1 \ --region us-east-2

The default capacity provider strategy provides the option of using Spot instances as steady state for your workloads with On-Demand instances for burst traffic. This option is more aggressive in terms of cost savings but with a higher risk profile. This is best suited for applications that can handle the downtime of Spot instance interruptions.

The custom capacity provider strategy enables you to use On-Demand instances as steady state for your workloads with Spot instances for burst traffic. This option has smaller cost savings but also a lower risk profile. This is best suited for applications that have to be running 24/7 and cannot afford any downtime.

ECS scheduled scaling

To use scheduled scaling, create scheduled actions, which tell Application Auto Scaling to perform scaling activities at specific times. When you create a scheduled action, you specify the scalable target, when the scaling activity should occur, and the minimum and maximum capacity. At the specified time, Application Auto Scaling scales based on the new capacity values.

Before you can create a scheduled action, you must register the scalable target. Use the register-scalable-target command to register a new scalable target. The following command registers an ECS service with Application Auto Scaling. This will scale the number of tasks in the ECS Service from a minimum of 1 task to a maximum of 10 tasks using the desired count.

aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/SchTestCluster/SchTestService \
--min-capacity 1 --max-capacity 10 \
--region us-east-2

[Note: Please update the date and times below as per your specific requirements]

To scale out one time to 10 tasks at 3PM EST (7:00 PM UTC)

aws application-autoscaling put-scheduled-action --service-namespace ecs \ --scalable-dimension ecs:service:DesiredCount \ --resource-id service/SchTestCluster/SchTestService \ --scheduled-action-name single-scaleout-action \ --schedule "at(2020-08-30T19:00:00)" \ --scalable-target-action MinCapacity=10,MaxCapacity=10 \ --region us-east-2

To scale in one time to 1 task at 4PM EST (8:00 PM UTC)

aws application-autoscaling put-scheduled-action --service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/SchTestCluster/SchTestService \
--scheduled-action-name single-scalein-action \
--schedule "at(2020-08-30T20:00:00)" \
--scalable-target-action MinCapacity=1,MaxCapacity=1 \
--region us-east-2

To scale out to 10 tasks every day at 8AM EST (12:00 PM UTC)

aws application-autoscaling put-scheduled-action --service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/SchTestCluster/SchTestService \
--scheduled-action-name cron-scaleout-action \
--schedule "cron(0 12 * * ? *)" \
--scalable-target-action MinCapacity=10,MaxCapacity=10 \
--region us-east-2

To scale in to 1 task every day at 6PM EST (10:00 PM UTC)

aws application-autoscaling put-scheduled-action --service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/SchTestCluster/SchTestService \
--scheduled-action-name cron-scalein-action \
--schedule "cron(0 22 * * ? *)" \
--scalable-target-action MinCapacity=1,MaxCapacity=1 \
--region us-east-2

At the date and time specified for –schedule, if the value specified for MaxCapacity is below the current capacity, Application Auto Scaling scales in to MaxCapacity and if the value specified for MinCapacity is above the current capacity, Application Auto Scaling scales out to MinCapacity

Conclusion

In this blog post, I have shown how to set up a scheduled scaling policy for an ECS service using Spot Capacity Provider as the primary provider to reduce cost. Consider using Reserved Instances in your On-Demand Capacity Provider [ASG] to further reduce your costs.

Reference Blogs/Documentation

https://docs.thinkwithwp.com/autoscaling/application/userguide/application-auto-scaling-scheduled-scaling.html
https://thinkwithwp.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/
https://docs.thinkwithwp.com/AmazonECS/latest/developerguide/scheduling_tasks.html
https://docs.thinkwithwp.com/AmazonECS/latest/developerguide/cluster-capacity-providers.html

Containers