亚马逊AWS官方博客

如何在跨账户环境下安全地执行操作系统脚本

1        前言

大型企业上云过程中,由于各个业务部门划分及治理需求不同,通常会设置多个账号开展业务,并在此基础上设计实施云端着陆区。考虑到云端安全责任共担模型,企业IT运维人员需要自主负责云端实例内操作系统的运维操作,随之而来的挑战是如何构建一个安全可行的方案架构,以满足日常大量的跨账号执行操作系统脚本的切实需求。由此,本文提出了基于亚马逊云科技的云原生服务构建跨账户安全执行操作系统脚本的方案设计。

客户使用亚马逊云科技服务云原生方案,相对常规的基于Puppet, Ansible等配置管理方案对比,除了,还有三个优势:一、执行模块基于云原生服务Amazon Systems Manager,开箱即用,结合Quick Setup,大大简化配置使用;二、跨账户场景不依赖于直接的网络连接,消除VPC地址重叠影响;三、采集执行结果存放S3,方便对接数据分析服务如Amazon Athena, Amazon EventBridge等,实现复杂的分析场景和工作流。

本方案适用场景:

1.复杂的操作系统和软件配置扫描

2.用于安全合规、系统信息分析等领域

2        方案设计

整体架构:

多账号跨Region环境下,基于Amazon Systems Manager托管实例中大规模执行系统Shell脚本,并将输出结果集中保存和查询

关键技术点:

1.方案授权:需要通过Amazon CloudFormation推送访问S3 bucket所需的policy,并利用terraform批量挂载policy到Managed EC2 instance profile role上,bucket policy需要给EC2所在的account和instance profile role授权

2.跨账号执行授权:需要设置AutomationExecutionRole

3.脚本执行路径:Managed EC2需要能够访问到SSM服务,通过IGW(Public IP), NAT或endpoint

4.日志保存路径:Managed EC2需要能够访问到存放结果的S3 Bucket, 通过IGW(Public IP), NAT或Gateway endpoint (注意:S3 Gateway endpoint不支持跨Region访问,因此存在这种场景需要在每个Region建立S3 Bucket并进行数据同步)

2.1       前置条件

1.完成配置Organization structure,包含所需管理的实例EC2所在的目标账户account分配在对应 OU,因为SSM的执行依赖于OU来指定目标。

2.完成Organization中的CloudFormation StackSet基础配置,用于向目标账户account推送所需的Role, Policy, Command Document等配置,参考https://docs.thinkwithwp.com/AmazonCloudFormation/latest/UserGuide/stacksets-prereqs.html

3.所有目标账户中的目标EC2已完成Amazon Systems Manager托管配置(通过Quick Setup或常规Role配置,参考Amazon Systems Manager文档),并且确保所有Managed Instance的状态无异常,均为”Online”

2.2       Amazon Systems Manager跨账号执行Automation配置:

1.在Systems Manager Administrator Account通过CloudFormation部署AutomationAdministrationRole:Amazon-SystemsManager-AutomationAdministrationRole.json

{
    "Description": "Configure the Amazon-SystemsManager-AutomationAdministrationRole to enable use of Amazon Systems Manager Cross Account/Region Automation execution.",
    "Resources": {
        "MasterAccountRole": {
            "Type": "Amazon::IAM::Role",
            "Properties": {
                "RoleName": "Amazon-SystemsManager-AutomationAdministrationRole",
                "AssumeRolePolicyDocument": {
                    "Statement": [
                        {
                            "Effect": "Allow",
                            "Principal": {
                                "Service": "ssm.amazonaws.com"----
                            },
                            "Action": [
                                "sts:AssumeRole"
                            ]
                        }
                    ]
                },
                "Path": "/",
                "Policies": [
                    {
                        "PolicyName": "AssumeRole-AmazonSystemsManagerAutomationExecutionRole",
                        "PolicyDocument": {
                            "Statement": [
                                {
                                    "Effect": "Allow",
                                    "Action": [
                                        "sts:AssumeRole"
                                    ],
                                    "Resource": {
                                        "Fn::Sub": "arn:${Amazon::Partition}:iam::*:role/Amazon-SystemsManager-AutomationExecutionRole"
                                    }
                                },
                                {
                                    "Effect": "Allow",
                                    "Action": [
                                        "organizations:ListAccountsForParent"
                                    ],
                                    "Resource": [
                                        "*"
                                    ]
                                }
                            ]
                        }
                    }
                ]
            }
        }
    }
}

2.登陆所有Target account通过CloudFormation StackSet部署AutomationExecutionRole:Amazon-SystemsManager-AutomationExecutionRole-CN.jsons

{
    "Parameters": {
        "MasterAccountId": {
            "Type": "String",
            "Description": "Amazon Account ID of the primary account (the account from which Amazon Systems Manager Automation will be initiated).",
            "MaxLength": 12,
            "MinLength": 12
        }
    },
    "Resources": {
        "AmazonSystemsManagerAutomationExecutionRole": {
            "Type": "Amazon::IAM::Role",
            "Properties": {
                "RoleName": "Amazon-SystemsManager-AutomationExecutionRole",
                "AssumeRolePolicyDocument": {
                    "Statement": [
                        {
                            "Effect": "Allow",
                            "Principal": {
                                "Amazon": {
                                    "Ref": "MasterAccountId"
                                }
                            },
                            "Action": [
                                "sts:AssumeRole"
                            ]
                        },
                        {
                            "Effect": "Allow",
                            "Principal": {
                                "Service": "ssm.amazonaws.com"
                            },
                            "Action": [
                                "sts:AssumeRole"
                            ]
                        }
                    ]
                },
                "ManagedPolicyArns": [
                    "arn:aws-cn:iam::aws:policy/service-role/AmazonSSMAutomationRole",
                    "arn:aws-cn:iam::aws:policy/AmazonEC2ReadOnlyAccess"
                ],
                "Path": "/",
                "Policies": [
                    {
                        "PolicyName": "ExecutionPolicy",
                        "PolicyDocument": {
                            "Statement": [
                                {
                                    "Effect": "Allow",
                                    "Action": [
                                        "resource-groups:ListGroupResources",
                                        "tag:GetResources"
                                    ],
                                    "Resource": "*"
                                },
                                {
                                    "Effect": "Allow",
                                    "Action": [
                                        "iam:PassRole"
                                    ],
                                    "Resource": {
                                        "Fn::Sub": "arn:${Amazon::Partition}:iam::${Amazon::AccountId}:role/Amazon-SystemsManager-AutomationExecutionRole"
                                    }
                                }
                            ]
                        }
                    }
                ]
            }
        }
    }
}

具体执行步骤可以参考: https://docs.thinkwithwp.com/systems-manager/latest/userguide/systems-manager-automation-multiple-accounts-and-regions.html

执行CloudFormation StackSet部署过程中,需注意以下几点:

  • 创建Role资源的StackSet仅需要在一个Region执行,因为IAM是Global Service

而不是Regional Service。

  • Amazon-SystemsManager-AutomationExecutionRole的cf模板不能直接用于中国区,ManagedPolicyArns中的Policy arn应修改为”aws-cn”
  • AmazonSSMAutomationRole可能缺少一些Automation Document所需的权限,应根据需要执行的Automation Document进行特定补充和调整

排障参考:

  • 可通过CloudTrail检查相关Role的权限问题

2.3       部署跨账号执行RunShellScript的Automation Document

1.方式1: 在Systems Manager Administrator account通过Amazon CloudFormation部署RunShellScript-AutomationDoc:RunShellScript-AutomationDoc.yaml

Description: >
  SSM Automation Document run a custom SSM Command Document against a fleet of
  target instances.jo
Parameters:
  AutomationDocumentName:
    Type: String
    Description: Name of created SSM Automation Document
    Default: RunShellScriptAutomation
Resources:
  AutomationDocument:
    Type: 'Amazon::SSM::Document'
    Properties:
      Name: !Ref AutomationDocumentName
      DocumentType: Automation
      Content:
        description: Run custom Command Document
        schemaVersion: '0.3'
        assumeRole: '{{AutomationAssumeRole}}'
        parameters:
          AutomationAssumeRole:
            type: String
            default: ''
            description: >-
              (Optional) The ARN of the role that allows Automation to perform
              the actions on your behalf.
          InstanceId:
            type: StringList
            description: "(Required) EC2 Instance(s) to run command"
          commands:
            type: StringList
            description: "(Required) Specify a shell script or a command to run."
            minItems: 1
            displayType: textarea
          workingDirectory:
            type: String
            default: ""
            description: "(Optional) The path to the working directory on your instance."
            maxChars: 4096
          executionTimeout:
            type: String
            default: "3600"
            description: "(Optional) The time in seconds for a command to complete before it is considered to have failed. Default is 3600 (1 hour). Maximum is 172800 (48 hours)."
            allowedPattern: "([1-9][0-9]{0,4})|(1[0-6][0-9]{4})|(17[0-1][0-9]{3})|(172[0-7][0-9]{2})|(172800)"
          OutputS3BucketName:
            type: String
            description: Name of Output S3 Bucket
            default: ''
          OutputS3KeyPrefix:
            type: String
            description: Output S3 Key Prefix
            default: ''
        mainSteps:
          - name: RunCommand
            action: 'aws:runCommand'
            inputs:
              DocumentName: Amazon-RunShellScript
              InstanceIds:
                - '{{InstanceId}}'
              Parameters:
                commands: '{{commands}}'
                workingDirectory: '{{workingDirectory}}'
                executionTimeout: '{{executionTimeout}}'
              OutputS3BucketName: '{{OutputS3BucketName}}'
              OutputS3KeyPrefix: '{{OutputS3KeyPrefix}}'

2.方式2: 在Systems Manager Administrator account通过CloudFormation部署封装了shell command的Command Document,参考模板:CommandDocument-Test2.yml

Description: |
  Test RunCommand and Output to Shared S3 Bucket
Parameters:
  CommandDocumentName:
    Type: String
    Description: Name of created SSM Command Document
    Default: UnnamedCommandDocuemnt
Resources:
  CommandDocument:
    Type: 'Amazon::SSM::Document'
    Properties:
      Name: !Ref CommandDocumentName
      DocumentType: Command
      Content:
        schemaVersion: '2.2'
        description: Test RunCommand and Output to Shared S3 Bucket
        mainSteps:
          - precondition:
              StringEquals:
                - platformType
                - Linux
            action: 'aws:runShellScript'
            name: runShellScript
            inputs:
              runCommand:
                - 'TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")'
                - 'curl http://169.254.169.254/latest/meta-data/instance-id -H "X-aws-ec2-metadata-token: $TOKEN"'

补充说明:

  • 如果目标实例是Windows,则需要改为调用Amazon-RunPowerShellScript Document

排障参考:

  • 可通过检查目标实例上SSM Agent的以下日志文件对Document执行异常进行排障
    /var/log/amazon/ssm/errors.log
    /var/log/amazon/ssm/amazon-ssm-agent.log

2.4       配置目标实例到Amazon Systems Manager和Amazon S3服务的网络连接

根据Instance所在网络情况的不同,选择以下合适的配置方式

分配Public IPEIP 配置NAT GatewayInstance 配置Systems ManagerS3服务的endpoint
位于Public Subnet的实例 可选 可选 首选
位于Private Subnet的实例 N/A 可选 首选

配置Amazon Systems Manager服务endpoint参考:

https://thinkwithwp.com/premiumsupport/knowledge-center/ec2-systems-manager-vpc-endpoints/

https://thinkwithwp.com/cn/blogs/mt/how-to-patch-windows-ec2-instances-in-private-subnets-using-aws-systems-manager/

配置S3服务endpoint参考:

https://docs.thinkwithwp.com/AmazonS3/latest/userguide/privatelink-interface-endpoints.html

补充说明:

  • S3 Gateway endpoint不支持跨Region访问,因此存在这种场景需要在每个Region单独建立S3 Bucket并进行数据同步
  • 注意endpoint的安全组配置

排障参考:

  • 可开启VPC Flowlog帮助排查EC2访问Systems Manager和S3服务的问题

2.5       创建存储command output的S3存储桶

Bucket Policy模板:

Bucket Policy模板:
{
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Amazon": [
                    "arn:aws-cn:iam::<account-id>:root"
                ]
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "<s3-bucket-arn>/*"
        }
    ]
}

参考:https://thinkwithwp.com/premiumsupport/knowledge-center/ssm-output-s3-other-account/

补充说明:

  • 方法1:Amazon S3 organization访问的通用配置方法:https://thinkwithwp.com/blogs/security/control-access-to-aws-resources-by-using-the-aws-organization-of-iam-principals/
  • 方法2:如上述方法配置不成功,可能是由于SSM Agent访问S3 API的代码没有传递PrincipalOrgId信息,则需通过列举需要收集的account列表作为替代方案,具体可借助Terraform的data特性获取organization的account列表来实现。
  • Script执行结果的收集也可使用Systems Manager的Custom Inventory功能进行。
  • 排障参考:可开启CloudTrail的Data Events帮助排查SSM Agent访问S3 Bucket时可能的权限问题

2.6       配置跨账号访问S3存储桶的IAM Policy

配置跨账号访问Amazon S3存储桶,操作步骤参考以下链接:

https://thinkwithwp.com/premiumsupport/knowledge-center/ssm-output-s3-other-account/。注意,除了S3 Bucket policy,EC2也需要做相应授权

在所有目标Account通过CloudFormation StackSet部署所需的SSMRunCommandOutputPolicy:SSMRunCommandOutputPolicy.template,注意模板未作优化,TargetS3BucketArn参数需要带上”/*”结尾

{
  "Description": "Policy attached to SSM Managed Instance's profile, which allow SSM RunCommand output to organization central S3 bucket",
  "Parameters": {
    "TargetS3BucketArn": {
      "Type": "String",
      "Description": "SSM RunCommand Output S3 Bucket ARN"
    }
  },
  "Resources": {
    "SSMRunCommandOutputPolicy": {
      "Type": "Amazon::IAM::ManagedPolicy",
      "Properties": {
        "ManagedPolicyName": "SSMRunCommandOutputPolicy",
        "PolicyDocument": {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:PutObjectAcl"
              ],
              "Resource": { "Ref": "TargetS3BucketArn" }
            }
          ]
        }
      }
    }
  }
}

将SSMRunCommandOutputPolicy利用terraform的data技术动态批量挂载到目标EC2的instance profile role上:batchupdaterole.zip。注意,需要利用Workspace进行不同account下tfstate的隔离

排障参考:

  • 可开启CloudTrail的Data Events帮助排查SSM Agent访问S3 Bucket时可能的权限问题。

2.7       编写固化Script脚本的自定义Command Document(可选)

对于需要大量复用的Script,可考虑单独做成自定义Command Document,方便进行调用,参考上述模板:CommandDocument-Test2.yml

部署方式:

  • 方法1(推荐):可在SSM Administrator Account使用CloudFormation部署Command Document后Share给其他account(注意只能在Region内共享)
  • 方法2: 也可在所有目标account通过CloudFormation StackSet部署自定义Command Document

2.8       部署Runbook(Automation Document)

跨accounts和region执行的Runbook(Automation Document)需要部署或共享到每一个target account中,以下模板可以在每个account中独立部署,也可以集中通过CloudFormation StackSet推送到所有account:

  • 方法1(推荐):调用Amazon-RunShellScript的托管Command Document的参考Runbook模板:RunShellScript-AutomationDoc.yaml,可在执行Automation时临时指定Script内容
  • 方法2: 调用自定义Command Document的参考Runbook模板:yaml,执行指定的Script
Description: >
  SSM Automation Document run a custom SSM Command Document against a fleet of
  target instances.
Parameters:
  AutomationDocumentName:
    Type: String
    Description: Name of created SSM Automation Document
    Default: MyAutomation
  CommandDocumentName:
    Type: String
    Description: Name of SSM Command Document to run
    Default: MyCommand
  OutputS3BucketName:
    Type: String
    Description: Name of Output S3 Bucket
    Default: ''
  OutputS3KeyPrefix:
    Type: String
    Description: Output S3 Key Prefix
    Default: ''
Conditions:
  HasS3Name:
    !Equals [!Ref OutputS3BucketName, '']
  OutputToS3:
    !Not [Condition: HasS3Name]
  HasS3Prefix:
    !Equals [!Ref OutputS3KeyPrefix, '']
  AddS3Prefix:
    !Not [Condition: HasS3Prefix]
Resources:
  AutomationDocument:
    Type: 'Amazon::SSM::Document'
    Properties:
      Name: !Ref AutomationDocumentName
      DocumentType: Automation
      Content:
        description: Run custom Command Document
        schemaVersion: '0.3'
        assumeRole: '{{AutomationAssumeRole}}'
        parameters:
          InstanceId:
            type: StringList
            description: "(Required) EC2 Instance(s) to run command"
          AutomationAssumeRole:
            type: String
            default: ''
            description: >-
              (Optional) The ARN of the role that allows Automation to perform
              the actions on your behalf.
        mainSteps:
          - name: RunCommand
            action: 'aws:runCommand'
            inputs:
              DocumentName: !Ref CommandDocumentName
              InstanceIds:
                - '{{InstanceId}}'
              OutputS3BucketName:
                !If
                - OutputToS3
                - !Ref OutputS3BucketName
                - !Ref "Amazon::NoValue"
              OutputS3KeyPrefix:
                !If
                - AddS3Prefix
                - !Ref OutputS3KeyPrefix
                - !Ref "Amazon::NoValue"

2.9       创建Athena database和table

推荐使用Athena向导以之前创建的S3存储桶为数据源,创建Athena database和table定义。具体操作步骤请参考:https://docs.thinkwithwp.com/zh_cn/zh_cn/athena/latest/ug/tables-location-format.html

3        使用

资源配置完成后,使用非常简单,只需两步即可完成:

1.通过console或者aws cli执行部署的Runbook:

示例指令:aws ssm start-automation-execution –document-name “RunShellAutomation” –document-version “\$DEFAULT” –parameters ‘{“AutomationAssumeRole”:[“”],”commands”:[“<要执行的shell指令>”]}’ –target-parameter-name InstanceId –targets ‘[{“Key”:”Amazon::EC2::Instance”,”Values”:[“*”]}]’ –max-errors “100%” –max-concurrency “100%” –target-locations ‘[{“Accounts”:[“<Target account Id or OU>”],”Regions”:[“cn-north-1″,”cn-northwest-1″],”TargetLocationMaxErrors”:”100%”,”TargetLocationMaxConcurrency”:”100%”},{“Accounts”:[“<Target account Id or OU>”],”Regions”:[“cn-north-1″,”cn-northwest-1″],”TargetLocationMaxErrors”:”100%”,”TargetLocationMaxConcurrency”:”100%”}]’ –region cn-northwest-1

2.通过Athena查询导出Script执行的结果汇总,参考查询模板:

SELECT split(“$path”,’/’)[4] as outputprefix,split(“$path”,’/’)[5] as commandid,split(“$path”,’/’)[6] as instanceid,split(“$path”,’/’)[9] AS device,output FROM “ssm”.”runcommandoutput” where split(“$path”,’/’)[4]=’runshelloutput’ and split(“$path”,’/’)[9]=’stdout’

4        扩展

基于以上架构,还可结合其他服务功能进一步扩展完善方案,包括但不限于:

  • 结合Amazon Systems Manager Maintenance Window和State Manager定时自动完成Script的执行过程
  • 结合Step Function实现Athena查询结果的复杂分发告警逻辑
  • 结合Custom Inventory将Script执行结果转化为Systems Manager Compliance Report,构成完整的操作系统配置扫描合规方案

5        结尾

至此,我们已经成功搭建了一个方案框架,用于实施跨账户大规模执行操作系统Shell脚本,并且自动化保存脚本执行结果,提供后续查询操作。回顾本方案优点:

1.利用亚马逊与科技云原生服务,快速部署配置环境;

2.支持跨账号、大规模并行执行,满足企业云端复杂业务场景;

3.集成开源IaC框架Terraform,支持多层嵌套脚本执行框架,灵活扩展;

本篇作者

霍延峰

拥有超过18年的IT从业经历,包括ERP系统前端和后端开发人员、OA系统和基础架构运维、企业软件系统架构、数据库管理员和商业智能工程师、虚拟化和私有云工程师、公有云解决方案架构师。他热衷于使用新技术来解决客户的实际问题,在云架构和应用现代化、DevOps、数据科学和其他领域具有独特的见解。他从2016年开始为使用Amazon的ECCOM客户提供帮助,并陪伴数十家客户的Amazon迁移旅程,以及ECCOM自身从传统网络系统集成商到Amazon的现代MSP服务提供商的转型过程。

陈琪

亚马逊云科技合作伙伴解决方案架构师,超过十年IT从业经验,涉及产品研发、方案咨询、产品管理等多个环节。长期从事企业信息化架构设计与方案咨询,具有丰富的企业数字化转型实践经验,尤其在安全、数据分析、网络等领域具有深刻认知及丰富实践。现负责亚马逊云科技合作伙伴的方案设计与能力构建。