1 前言
大型企业上云过程中,由于各个业务部门划分及治理需求不同,通常会设置多个账号开展业务,并在此基础上设计实施云端着陆区。考虑到云端安全责任共担模型,企业IT运维人员需要自主负责云端实例内操作系统的运维操作,随之而来的挑战是如何构建一个安全可行的方案架构,以满足日常大量的跨账号执行操作系统脚本的切实需求。由此,本文提出了基于亚马逊云科技的云原生服务构建跨账户安全执行操作系统脚本的方案设计。
客户使用亚马逊云科技服务云原生方案,相对常规的基于Puppet, Ansible等配置管理方案对比,除了,还有三个优势:一、执行模块基于云原生服务Amazon Systems Manager,开箱即用,结合Quick Setup,大大简化配置使用;二、跨账户场景不依赖于直接的网络连接,消除VPC地址重叠影响;三、采集执行结果存放S3,方便对接数据分析服务如Amazon Athena, Amazon EventBridge等,实现复杂的分析场景和工作流。
本方案适用场景:
1.复杂的操作系统和软件配置扫描
2.用于安全合规、系统信息分析等领域
2 方案设计
整体架构:
多账号跨Region环境下,基于Amazon Systems Manager托管实例中大规模执行系统Shell脚本,并将输出结果集中保存和查询
关键技术点:
1.方案授权:需要通过Amazon CloudFormation推送访问S3 bucket所需的policy,并利用terraform批量挂载policy到Managed EC2 instance profile role上,bucket policy需要给EC2所在的account和instance profile role授权
2.跨账号执行授权:需要设置AutomationExecutionRole
3.脚本执行路径:Managed EC2需要能够访问到SSM服务,通过IGW(Public IP), NAT或endpoint
4.日志保存路径:Managed EC2需要能够访问到存放结果的S3 Bucket, 通过IGW(Public IP), NAT或Gateway endpoint (注意:S3 Gateway endpoint不支持跨Region访问,因此存在这种场景需要在每个Region建立S3 Bucket并进行数据同步)
2.1 前置条件
1.完成配置Organization structure,包含所需管理的实例EC2所在的目标账户account分配在对应 OU,因为SSM的执行依赖于OU来指定目标。
2.完成Organization中的CloudFormation StackSet基础配置,用于向目标账户account推送所需的Role, Policy, Command Document等配置,参考https://docs.thinkwithwp.com/AmazonCloudFormation/latest/UserGuide/stacksets-prereqs.html
3.所有目标账户中的目标EC2已完成Amazon Systems Manager托管配置(通过Quick Setup或常规Role配置,参考Amazon Systems Manager文档),并且确保所有Managed Instance的状态无异常,均为”Online”
2.2 Amazon Systems Manager跨账号执行Automation配置:
1.在Systems Manager Administrator Account通过CloudFormation部署AutomationAdministrationRole:Amazon-SystemsManager-AutomationAdministrationRole.json
{
"Description": "Configure the Amazon-SystemsManager-AutomationAdministrationRole to enable use of Amazon Systems Manager Cross Account/Region Automation execution.",
"Resources": {
"MasterAccountRole": {
"Type": "Amazon::IAM::Role",
"Properties": {
"RoleName": "Amazon-SystemsManager-AutomationAdministrationRole",
"AssumeRolePolicyDocument": {
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ssm.amazonaws.com"----
},
"Action": [
"sts:AssumeRole"
]
}
]
},
"Path": "/",
"Policies": [
{
"PolicyName": "AssumeRole-AmazonSystemsManagerAutomationExecutionRole",
"PolicyDocument": {
"Statement": [
{
"Effect": "Allow",
"Action": [
"sts:AssumeRole"
],
"Resource": {
"Fn::Sub": "arn:${Amazon::Partition}:iam::*:role/Amazon-SystemsManager-AutomationExecutionRole"
}
},
{
"Effect": "Allow",
"Action": [
"organizations:ListAccountsForParent"
],
"Resource": [
"*"
]
}
]
}
}
]
}
}
}
}
2.登陆所有Target account通过CloudFormation StackSet部署AutomationExecutionRole:Amazon-SystemsManager-AutomationExecutionRole-CN.jsons
{
"Parameters": {
"MasterAccountId": {
"Type": "String",
"Description": "Amazon Account ID of the primary account (the account from which Amazon Systems Manager Automation will be initiated).",
"MaxLength": 12,
"MinLength": 12
}
},
"Resources": {
"AmazonSystemsManagerAutomationExecutionRole": {
"Type": "Amazon::IAM::Role",
"Properties": {
"RoleName": "Amazon-SystemsManager-AutomationExecutionRole",
"AssumeRolePolicyDocument": {
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Amazon": {
"Ref": "MasterAccountId"
}
},
"Action": [
"sts:AssumeRole"
]
},
{
"Effect": "Allow",
"Principal": {
"Service": "ssm.amazonaws.com"
},
"Action": [
"sts:AssumeRole"
]
}
]
},
"ManagedPolicyArns": [
"arn:aws-cn:iam::aws:policy/service-role/AmazonSSMAutomationRole",
"arn:aws-cn:iam::aws:policy/AmazonEC2ReadOnlyAccess"
],
"Path": "/",
"Policies": [
{
"PolicyName": "ExecutionPolicy",
"PolicyDocument": {
"Statement": [
{
"Effect": "Allow",
"Action": [
"resource-groups:ListGroupResources",
"tag:GetResources"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": {
"Fn::Sub": "arn:${Amazon::Partition}:iam::${Amazon::AccountId}:role/Amazon-SystemsManager-AutomationExecutionRole"
}
}
]
}
}
]
}
}
}
}
具体执行步骤可以参考: https://docs.thinkwithwp.com/systems-manager/latest/userguide/systems-manager-automation-multiple-accounts-and-regions.html
执行CloudFormation StackSet部署过程中,需注意以下几点:
- 创建Role资源的StackSet仅需要在一个Region执行,因为IAM是Global Service
而不是Regional Service。
- Amazon-SystemsManager-AutomationExecutionRole的cf模板不能直接用于中国区,ManagedPolicyArns中的Policy arn应修改为”aws-cn”
- AmazonSSMAutomationRole可能缺少一些Automation Document所需的权限,应根据需要执行的Automation Document进行特定补充和调整
排障参考:
- 可通过CloudTrail检查相关Role的权限问题
2.3 部署跨账号执行RunShellScript的Automation Document
1.方式1: 在Systems Manager Administrator account通过Amazon CloudFormation部署RunShellScript-AutomationDoc:RunShellScript-AutomationDoc.yaml
Description: >
SSM Automation Document run a custom SSM Command Document against a fleet of
target instances.jo
Parameters:
AutomationDocumentName:
Type: String
Description: Name of created SSM Automation Document
Default: RunShellScriptAutomation
Resources:
AutomationDocument:
Type: 'Amazon::SSM::Document'
Properties:
Name: !Ref AutomationDocumentName
DocumentType: Automation
Content:
description: Run custom Command Document
schemaVersion: '0.3'
assumeRole: '{{AutomationAssumeRole}}'
parameters:
AutomationAssumeRole:
type: String
default: ''
description: >-
(Optional) The ARN of the role that allows Automation to perform
the actions on your behalf.
InstanceId:
type: StringList
description: "(Required) EC2 Instance(s) to run command"
commands:
type: StringList
description: "(Required) Specify a shell script or a command to run."
minItems: 1
displayType: textarea
workingDirectory:
type: String
default: ""
description: "(Optional) The path to the working directory on your instance."
maxChars: 4096
executionTimeout:
type: String
default: "3600"
description: "(Optional) The time in seconds for a command to complete before it is considered to have failed. Default is 3600 (1 hour). Maximum is 172800 (48 hours)."
allowedPattern: "([1-9][0-9]{0,4})|(1[0-6][0-9]{4})|(17[0-1][0-9]{3})|(172[0-7][0-9]{2})|(172800)"
OutputS3BucketName:
type: String
description: Name of Output S3 Bucket
default: ''
OutputS3KeyPrefix:
type: String
description: Output S3 Key Prefix
default: ''
mainSteps:
- name: RunCommand
action: 'aws:runCommand'
inputs:
DocumentName: Amazon-RunShellScript
InstanceIds:
- '{{InstanceId}}'
Parameters:
commands: '{{commands}}'
workingDirectory: '{{workingDirectory}}'
executionTimeout: '{{executionTimeout}}'
OutputS3BucketName: '{{OutputS3BucketName}}'
OutputS3KeyPrefix: '{{OutputS3KeyPrefix}}'
2.方式2: 在Systems Manager Administrator account通过CloudFormation部署封装了shell command的Command Document,参考模板:CommandDocument-Test2.yml
Description: |
Test RunCommand and Output to Shared S3 Bucket
Parameters:
CommandDocumentName:
Type: String
Description: Name of created SSM Command Document
Default: UnnamedCommandDocuemnt
Resources:
CommandDocument:
Type: 'Amazon::SSM::Document'
Properties:
Name: !Ref CommandDocumentName
DocumentType: Command
Content:
schemaVersion: '2.2'
description: Test RunCommand and Output to Shared S3 Bucket
mainSteps:
- precondition:
StringEquals:
- platformType
- Linux
action: 'aws:runShellScript'
name: runShellScript
inputs:
runCommand:
- 'TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")'
- 'curl http://169.254.169.254/latest/meta-data/instance-id -H "X-aws-ec2-metadata-token: $TOKEN"'
补充说明:
- 如果目标实例是Windows,则需要改为调用Amazon-RunPowerShellScript Document
排障参考:
- 可通过检查目标实例上SSM Agent的以下日志文件对Document执行异常进行排障
/var/log/amazon/ssm/errors.log
/var/log/amazon/ssm/amazon-ssm-agent.log
2.4 配置目标实例到Amazon Systems Manager和Amazon S3服务的网络连接
根据Instance所在网络情况的不同,选择以下合适的配置方式
|
分配Public IP或EIP |
配置NAT Gateway或Instance |
配置Systems Manager和S3服务的endpoint |
位于Public Subnet的实例 |
可选 |
可选 |
首选 |
位于Private Subnet的实例 |
N/A |
可选 |
首选 |
配置Amazon Systems Manager服务endpoint参考:
https://thinkwithwp.com/premiumsupport/knowledge-center/ec2-systems-manager-vpc-endpoints/
https://thinkwithwp.com/cn/blogs/mt/how-to-patch-windows-ec2-instances-in-private-subnets-using-aws-systems-manager/
配置S3服务endpoint参考:
https://docs.thinkwithwp.com/AmazonS3/latest/userguide/privatelink-interface-endpoints.html
补充说明:
- S3 Gateway endpoint不支持跨Region访问,因此存在这种场景需要在每个Region单独建立S3 Bucket并进行数据同步
- 注意endpoint的安全组配置
排障参考:
- 可开启VPC Flowlog帮助排查EC2访问Systems Manager和S3服务的问题
2.5 创建存储command output的S3存储桶
Bucket Policy模板:
Bucket Policy模板:
{
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Amazon": [
"arn:aws-cn:iam::<account-id>:root"
]
},
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": "<s3-bucket-arn>/*"
}
]
}
参考:https://thinkwithwp.com/premiumsupport/knowledge-center/ssm-output-s3-other-account/
补充说明:
2.6 配置跨账号访问S3存储桶的IAM Policy
配置跨账号访问Amazon S3存储桶,操作步骤参考以下链接:
https://thinkwithwp.com/premiumsupport/knowledge-center/ssm-output-s3-other-account/。注意,除了S3 Bucket policy,EC2也需要做相应授权
在所有目标Account通过CloudFormation StackSet部署所需的SSMRunCommandOutputPolicy:SSMRunCommandOutputPolicy.template,注意模板未作优化,TargetS3BucketArn参数需要带上”/*”结尾
{
"Description": "Policy attached to SSM Managed Instance's profile, which allow SSM RunCommand output to organization central S3 bucket",
"Parameters": {
"TargetS3BucketArn": {
"Type": "String",
"Description": "SSM RunCommand Output S3 Bucket ARN"
}
},
"Resources": {
"SSMRunCommandOutputPolicy": {
"Type": "Amazon::IAM::ManagedPolicy",
"Properties": {
"ManagedPolicyName": "SSMRunCommandOutputPolicy",
"PolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:PutObjectAcl"
],
"Resource": { "Ref": "TargetS3BucketArn" }
}
]
}
}
}
}
}
将SSMRunCommandOutputPolicy利用terraform的data技术动态批量挂载到目标EC2的instance profile role上:batchupdaterole.zip。注意,需要利用Workspace进行不同account下tfstate的隔离
排障参考:
- 可开启CloudTrail的Data Events帮助排查SSM Agent访问S3 Bucket时可能的权限问题。
2.7 编写固化Script脚本的自定义Command Document(可选)
对于需要大量复用的Script,可考虑单独做成自定义Command Document,方便进行调用,参考上述模板:CommandDocument-Test2.yml
部署方式:
- 方法1(推荐):可在SSM Administrator Account使用CloudFormation部署Command Document后Share给其他account(注意只能在Region内共享)
- 方法2: 也可在所有目标account通过CloudFormation StackSet部署自定义Command Document
2.8 部署Runbook(Automation Document)
跨accounts和region执行的Runbook(Automation Document)需要部署或共享到每一个target account中,以下模板可以在每个account中独立部署,也可以集中通过CloudFormation StackSet推送到所有account:
- 方法1(推荐):调用Amazon-RunShellScript的托管Command Document的参考Runbook模板:RunShellScript-AutomationDoc.yaml,可在执行Automation时临时指定Script内容
- 方法2: 调用自定义Command Document的参考Runbook模板:yaml,执行指定的Script
Description: >
SSM Automation Document run a custom SSM Command Document against a fleet of
target instances.
Parameters:
AutomationDocumentName:
Type: String
Description: Name of created SSM Automation Document
Default: MyAutomation
CommandDocumentName:
Type: String
Description: Name of SSM Command Document to run
Default: MyCommand
OutputS3BucketName:
Type: String
Description: Name of Output S3 Bucket
Default: ''
OutputS3KeyPrefix:
Type: String
Description: Output S3 Key Prefix
Default: ''
Conditions:
HasS3Name:
!Equals [!Ref OutputS3BucketName, '']
OutputToS3:
!Not [Condition: HasS3Name]
HasS3Prefix:
!Equals [!Ref OutputS3KeyPrefix, '']
AddS3Prefix:
!Not [Condition: HasS3Prefix]
Resources:
AutomationDocument:
Type: 'Amazon::SSM::Document'
Properties:
Name: !Ref AutomationDocumentName
DocumentType: Automation
Content:
description: Run custom Command Document
schemaVersion: '0.3'
assumeRole: '{{AutomationAssumeRole}}'
parameters:
InstanceId:
type: StringList
description: "(Required) EC2 Instance(s) to run command"
AutomationAssumeRole:
type: String
default: ''
description: >-
(Optional) The ARN of the role that allows Automation to perform
the actions on your behalf.
mainSteps:
- name: RunCommand
action: 'aws:runCommand'
inputs:
DocumentName: !Ref CommandDocumentName
InstanceIds:
- '{{InstanceId}}'
OutputS3BucketName:
!If
- OutputToS3
- !Ref OutputS3BucketName
- !Ref "Amazon::NoValue"
OutputS3KeyPrefix:
!If
- AddS3Prefix
- !Ref OutputS3KeyPrefix
- !Ref "Amazon::NoValue"
2.9 创建Athena database和table
推荐使用Athena向导以之前创建的S3存储桶为数据源,创建Athena database和table定义。具体操作步骤请参考:https://docs.thinkwithwp.com/zh_cn/zh_cn/athena/latest/ug/tables-location-format.html
3 使用
资源配置完成后,使用非常简单,只需两步即可完成:
1.通过console或者aws cli执行部署的Runbook:
示例指令:aws ssm start-automation-execution –document-name “RunShellAutomation” –document-version “\$DEFAULT” –parameters ‘{“AutomationAssumeRole”:[“”],”commands”:[“<要执行的shell指令>”]}’ –target-parameter-name InstanceId –targets ‘[{“Key”:”Amazon::EC2::Instance”,”Values”:[“*”]}]’ –max-errors “100%” –max-concurrency “100%” –target-locations ‘[{“Accounts”:[“<Target account Id or OU>”],”Regions”:[“cn-north-1″,”cn-northwest-1″],”TargetLocationMaxErrors”:”100%”,”TargetLocationMaxConcurrency”:”100%”},{“Accounts”:[“<Target account Id or OU>”],”Regions”:[“cn-north-1″,”cn-northwest-1″],”TargetLocationMaxErrors”:”100%”,”TargetLocationMaxConcurrency”:”100%”}]’ –region cn-northwest-1
2.通过Athena查询导出Script执行的结果汇总,参考查询模板:
SELECT split(“$path”,’/’)[4] as outputprefix,split(“$path”,’/’)[5] as commandid,split(“$path”,’/’)[6] as instanceid,split(“$path”,’/’)[9] AS device,output FROM “ssm”.”runcommandoutput” where split(“$path”,’/’)[4]=’runshelloutput’ and split(“$path”,’/’)[9]=’stdout’
4 扩展
基于以上架构,还可结合其他服务功能进一步扩展完善方案,包括但不限于:
- 结合Amazon Systems Manager Maintenance Window和State Manager定时自动完成Script的执行过程
- 结合Step Function实现Athena查询结果的复杂分发告警逻辑
- 结合Custom Inventory将Script执行结果转化为Systems Manager Compliance Report,构成完整的操作系统配置扫描合规方案
5 结尾
至此,我们已经成功搭建了一个方案框架,用于实施跨账户大规模执行操作系统Shell脚本,并且自动化保存脚本执行结果,提供后续查询操作。回顾本方案优点:
1.利用亚马逊与科技云原生服务,快速部署配置环境;
2.支持跨账号、大规模并行执行,满足企业云端复杂业务场景;
3.集成开源IaC框架Terraform,支持多层嵌套脚本执行框架,灵活扩展;
本篇作者