AWS Partner Network (APN) Blog
Accelerate Clinical Research with Real-World Data Using AWS Data Exchange for Amazon Redshift
By Thyge Sullivan Knuhtsen, Solutions Architect – AWS
By Vinodh Krishnamoorthy, Solutions Architect – AWS
By Yana Nikitina, VP Engineering – Verana Health
By Rohit Bordia, Platform Architect – Verana Health
Verana Health |
Verana Health is an AWS Partner and AWS Marketplace Seller revolutionizing patient care and clinical research by unlocking the potential of real-world data (RWD).
Verana Health has an exclusive RWD network of 90 million de-identified patients from more than 20,000 clinicians, stemming from its exclusive data partnerships with the American Academy of Ophthalmology, American Academy of Neurology, and American Urological Association.
Using its clinician-informed and artificial intelligence (AI)-enhanced VeraQ population health data engine, Verana Health transforms structured and unstructured data into high-quality, curated, disease-specific data modules, called Qdata.
Verana’s Qdata helps power analytics solutions and software-as-a-service (SaaS) products for real-world evidence generation, clinical trials enablement, clinician quality reporting, and medical registry data management. Verana’s high-quality data and insights help drive progress in medicine to enhance the quality of care and quality of life for patients.
Through the AWS Data Exchange, and enabled by Amazon Redshift, Verana offers its life sciences customers an easy, convenient, and secure method of accessing high-quality clinical RWD for research use.
In this post, we will show how data providers are able to list Amazon Redshift data products on the AWS Data Exchange.
Customer Use Case
A leading global, research-driven biopharmaceutical company was interested in accessing high-quality, curated ophthalmology data in five disease areas: diabetic retinopathy, diabetic macular edema, diabetic macular ischemia, neovascular age-related macular degeneration, and geographic atrophy.
Real-world data would be used to help enable research in the following use cases: natural history of disease, patient population characterization, and treatment patterns and outcomes.
The biopharma company had a definitive requirement to subscribe to the dataset on a multi-year basis and access the data via AWS Data Exchange. Other key requirements included the size of the patient data cohort, depth of analyzable variables to support multiple use cases, and ability to refresh the data sources via the cloud on a quarterly basis.
Solution Overview
Verana Health provided the customer with sample datasets in the five disease areas of interest through AWS Data Exchange. This allowed the customer to evaluate the quality of Verana Health’s data for direct data access to meet their needs. The datasets allows the customer to analyze data and conduct research to understand characteristics of patient populations impacted by these retinal diseases and identify appropriate treatments.
Patients with retinal diseases experience visual impairment and blindness, which can have a profound and devastating impact. Retinal disease research can help lead to new and improved treatments, identify which treatments work best, and increase an understanding of when and how to treat patients in order to prevent, slow, or restore vision loss.
Through the streamlined AWS Data Exchange process, the customer was able to quickly and easily access Verana Health’s curated, quality data modules, Qdata, that include defined schema and documentation on how to use the data to solve business use cases. The evaluation of the dataset led to the execution of a subscription agreement for Qdata to be delivered on a quarterly basis for the duration of two years.
The key benefit for the biopharma company was to have seamless access to high-quality, disease-specific datasets which enabled it to rapidly deliver a significant number of real-world evidence (RWE) studies, with limited overhead, compared to the traditional project-by-project approach to RWE.
Technical Deployment
For this post, we’ll focus on the Data Provider role. The following architecture overview highlights what we’ll cover over the next couple of steps.
Figure 1 – Architecture overview workflow.
In this example, a data provider leverages Amazon Simple Storage Service (Amazon S3) for flat-file storage of real-world evidence. Flat-files are stored as Parquet (column-oriented data file), and an open-source Python library, psycopg2, is deployed via AWS Lambda to call flat-files from S3 and COPY said flat-files into Redshift Serverless Namespacevia Redshift Serverless Workgroup.
Staging tables are used to hold all CRUD operations before pushing changes to Production or Target tables. From the Production/Target table, a Redshift Data Share is cut with select schemas, tables, and views as the data product. The data product, the Redshift Data Share, is commercialized via AWS Data Exchange.
Prerequisites
Ensure the AWS Command Line Interface (AWS CLI) is configured and ready to work with AWS. For additional information, refer to the documentation.
AWS allows you to configure credentials via environment variables, AWS IAM Identity Center, or a configuration file. For more information, please see the documentation.
For this guide, we’ll assume the local CLI session is configured to work with AWS. For example, it can mean that environment variables are set:
export AWS_ACCESS_KEY_ID=<AWS Access Key ID>
export AWS_SECRET_ACCESS_KEY=<AWS Secret Access Key>
export AWS_DEFAULT_REGION=<AWS Region>
Ensure the user or role used in the CLI session has sufficient permissions to execute the actions outlined in this guide.
Sample Code
Explicitly, this guide helps deploy the following architecture. Sample code can be found in this GitHub repository.
Figure 2 – Deployed sample code.
Create Amazon S3 Bucket
To create an Amazon S3 bucket as the source, execute the following commands:
aws cloudformation create-stack --stack-name <STACK_NAME> \
--template-body file://01-s3-bucket/template.yml \
--parameters ParameterKey=BucketName,ParameterValue='<S3_DATA_BUCKET_NAME>' \
--query 'StackId' --output text
aws cloudformation wait stack-create-complete --stack-name <STACK_NAME>
aws cloudformation describe-stacks --stack-name <STACK_NAME> --query 'Stacks[0].Outputs' --output table
The script above deploys an AWS CloudFormation stack with the name <STACK_NAME> using the [template](./01-s3-bucket/template.yml) and returns outputs from the stack deployed.
The template defines an S3 bucket with server-side encryption. This bucket will be used as a source of the data (.parquet) and for Lambda deployments. A parameter with name <S3_BUCKET_NAME> is required and should be provided.
In the example below, club-innocent-s3, <S3_BUCKET_NAME> is replaced with club-innocent-us-east-1:
aws cloudformation create-stack --stack-name club-innocent-s3 \
--template-body file://01-s3-bucket/template.yml \
--parameters ParameterKey=BucketName,ParameterValue='club-innocent-us-east-1' \
--query 'StackId' --output text
aws cloudformation wait stack-create-complete --stack-name club-innocent-s3
aws cloudformation describe-stacks --stack-name club-innocent-s3 --query 'Stacks[0].Outputs' --output table
-----------------------------------------------------------------------------------------------
| DescribeStacks |
+------------------------------------+-------------+------------------------------------------+
| Description | OutputKey | OutputValue |
+------------------------------------+-------------+------------------------------------------+
| The ARN of the created S3 bucket | BucketArn | arn:aws:s3:::club-innocent-us-east-1 |
| The name of the created S3 bucket | BucketName | club-innocent-us-east-1 |
+------------------------------------+-------------+------------------------------------------+
Upload .parquet File to Created S3 Bucket
To upload a file to an S3 bucket, use the aws s3 cp command:
aws s3 cp <PATH_TO_FILE> s3://<BUCKET_NAME>/
In the example below, the .parquet file is located in the current working directory and has name titanic.parquet; the name of the S3 bucket has been taken from the previous step:
aws s3 cp ./titanic.parquet s3://club-innocent-us-east-1/data/
Next, upload: ./titanic.parquet to s3://club-innocent-us-east-1/data/titanic.parquet
Create VPC (Optional)
If you already have a deployed virtual private cloud (with subnets and security groups), please proceed to the next step.
To deploy a VPC with high availability and fault tolerance, execute the following command:
aws cloudformation create-stack --stack-name <STACK_NAME> \
--template-body file://02-vpc/template.yml \
--parameters ParameterKey=CIDRBlock,ParameterValue='<VPC_CIDR_BLOCK_VALUE>' \
ParameterKey=DeployNATGateway1,ParameterValue='<DEPLOY_NAT_GATEWAY_1>' \
ParameterKey=DeployNATGateway2,ParameterValue='<DEPLOY_NAT_GATEWAY_2>' \
--query 'StackId' --output text
aws cloudformation wait stack-create-complete --stack-name <STACK_NAME>
aws cloudformation describe-stacks --stack-name <STACK_NAME> --query 'Stacks[0].Outputs' --output table
The script above deploys a CloudFormation stack with the name <STACK_NAME> using the [template](./02-vpc/template.yml) and returns outputs from the stack deployed. The template defines a VPC with three public and three private subnets in three AWS Availability Zones (AZs) to achieve high availability and fault tolerance.
Additionally, the VPC is equipped with a dedicated internet gateway. By default, network access translation (NAT) gateways are not deployed but can be optionally added to provide outgoing internet connectivity for resources in the private subnets.
The template establishes two security groups; one for public and another for private resources. Only connections originating from the public security group are permitted to access the private security group.
In the example below, <STACK_NAME> is replaced with club-innocent-vpc, and default values for parameters are used:
aws cloudformation create-stack --stack-name club-innocent-vpc \
--template-body file://02-vpc/template.yml \
--query 'StackId' --output text
aws cloudformation wait stack-create-complete --stack-name club-innocent-vpc
aws cloudformation describe-stacks --stack-name club-innocent-vpc --query 'Stacks[0].Outputs' --output table
--------------------------------------------------------------------------------------------------------
| DescribeStacks |
+----------------------------------------------+--------------------------+----------------------------+
| Description | OutputKey | OutputValue |
+----------------------------------------------+--------------------------+----------------------------+
| The ID of the created VPC | VpcID | vpc-02c12c124663789op |
| The ID of the created Public Subnet 1 | PublicSubnet1Id | subnet-0c82be60af536c872 |
| The ID of the created Public Subnet 2 | PublicSubnet2Id | subnet-0efdf272744c33sse |
| The ID of the created Public Subnet 3 | PublicSubnet3Id | subnet-009d191fec1018eg2 |
| The ID of the created Private Subnet 3 | PrivateSubnet3Id | subnet-052a9561552fbe73c |
| The ID of the created Private Subnet 2 | PrivateSubnet2Id | subnet-08122d03450de78ed |
| The ID of the created Private Subnet 1 | PrivateSubnet1Id | subnet-0754793cb34114ea7 |
| The ID of the created Public Security Group | PublicSecurityGroupId | sg-04aa035df3ff378ed |
| The ID of the created Private Security Group| PrivateSecurityGroupId | sg-08d8c4c444e78el19 |
+----------------------------------------------+--------------------------+----------------------------+
Deploy Amazon Redshift Serverless
To deploy Amazon Redshift Serverless, execute the following command:
aws cloudformation create-stack --stack-name <STACK_NAME> \
--template-body file://03-redshift-serverless/template.yml \
--parameters ParameterKey=NamespaceName,ParameterValue='<NAMESPACE_NAME>' \
ParameterKey=WorkgroupName,ParameterValue='<WORKGROUP_NAME>' \
ParameterKey=DatabaseName,ParameterValue='<DATABASE_NAME>' \
ParameterKey=AdminUsername,ParameterValue='<ADMIN_USERNAME>' \
ParameterKey=AdminUserPassword,ParameterValue='<ADMIN_USER_PASSWORD>' \
ParameterKey=BaseRPU,ParameterValue='<BASE_RPU>' \
ParameterKey=PubliclyAccessible,ParameterValue='<PUBLICLY_ACCESSIBLE>' \
ParameterKey=SubnetIds,ParameterValue='<SUBNET_IDS>' \
ParameterKey=SecurityGroupIds,ParameterValue='<SECURITY_GROUP_IDS' \
ParameterKey=LogExportsList,ParameterValue='<LOG_EXPORTS_LIST>' \
ParameterKey=EnhancedVpcRouting,ParameterValue='<ENHANCED_VPC_ROUTING>' \
--capabilities CAPABILITY_NAMED_IAM \
--query 'StackId' --output text
aws cloudformation wait stack-create-complete --stack-name <STACK_NAME>
aws cloudformation describe-stacks --stack-name <STACK_NAME> --query 'Stacks[0].Outputs' --output table
The script above deploys a CloudFormation stack with the name <STACK_NAME> using the [template](./03-redshift-serverless/template.yml) and returns outputs from the stack deployed. The template defines a Redshift Serverless Workgroup and Namespace, deployed into three subnets for high availability. If you don’t have a deployed VPC with subnets and security groups, look into the step above.
In the example below, <STACK_NAME> is replaced with club-innocent-redshift-serverless, while <NAMESPACE_NAME> is titanic, and <WORKGROUP_NAME> is substituted for passengers. Subnets (private) and security groups (private) IDs are taken from the previous example, and default values for parameters are used:
aws cloudformation create-stack --stack-name club-innocent-redshift-serverless \
--template-body file://03-redshift-serverless/template.yml \
--parameters ParameterKey=NamespaceName,ParameterValue='titanic' \
ParameterKey=AdminUserPassword,ParameterValue='tsbHH8KUisPtXxje' \
ParameterKey=WorkgroupName,ParameterValue='passengers' \
ParameterKey=SubnetIds,ParameterValue='subnet-052a9561552fbe73c\,subnet-08122d03450de78ed\,subnet-0754793cb34114ea7' \
ParameterKey=SecurityGroupIds,ParameterValue='sg-08d8c4c444e78el19' \
--capabilities CAPABILITY_NAMED_IAM \
--query 'StackId' --output text
aws cloudformation wait stack-create-complete --stack-name club-innocent-redshift-serverless
aws cloudformation describe-stacks --stack-name club-innocent-redshift-serverless --query 'Stacks[0].Outputs' --output table
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| DescribeStacks |
+------------------------------------------------------------------------------------+--------------------------+--------------------------------------------------------------------------+
| Description | OutputKey | OutputValue |
+------------------------------------------------------------------------------------+--------------------------+--------------------------------------------------------------------------+
| The name of the created Redshift Serverless Workgroup | WorkgroupName | passengers |
| The database name of the created Redshift Serverless Workgroup | WorkgroupDatabase | dev |
| The ARN of the created S3 Read-Only Access Role for Redshift Serverless Namespace | S3ReadOnlyAccessRoleArn | arn:aws:iam::************:role/redshift-s3-access-titanic-us-east-1 |
| The host of the created Redshift Serverless Workgroup | WorkgroupHost | club-innocentpassengers.************.us-east-1.redshift-serverless.amazonaws.com |
| The port of the created Redshift Serverless Workgroup | WorkgroupPort | 5439 |
+------------------------------------------------------------------------------------+--------------------------+--------------------------------------------------------------------------+
Deploy AWS Lambda Function
To deploy the AWS Lambda function, the [source code](./04-lambda/src) should be zipped and uploaded to the designated S3 bucket. To do so, execute the next statements:
cd ./04-lambda/src && zip -r ../deploy.zip . && cd ../../
aws s3 cp ./04-lambda/deploy.zip s3://<BUCKET_NAME>/lambda/copy-s3-to-redshift/deploy.zip
In the example below, the bucket created before is used:
cd ./04-lambda/src && zip -r ../deploy.zip . && cd ../../
aws s3 cp ./04-lambda/deploy.zip s3://club-innocent-us-east-1/lambda/copy-s3-to-redshift/deploy.zip
updating: psycopg2/ (stored 0%)
updating: psycopg2/_json.py (deflated 63%)
updating: psycopg2/extras.py (deflated 73%)
updating: psycopg2/_psycopg.cpython-39-x86_64-linux-gnu.so (deflated 66%)
updating: psycopg2/errorcodes.py (deflated 65%)
updating: psycopg2/tz.py (deflated 62%)
updating: psycopg2/_range.py (deflated 72%)
updating: psycopg2/_ipaddress.py (deflated 56%)
updating: psycopg2/__init__.py (deflated 56%)
updating: psycopg2/extensions.py (deflated 58%)
updating: psycopg2/errors.py (deflated 46%)
updating: psycopg2/sql.py (deflated 71%)
updating: psycopg2/pool.py (deflated 68%)
updating: lambda_function.py (deflated 54%)
upload: 04-lambda/deploy.zip to s3://club-innocent-us-east-1/lambda/copy-s3-to-redshift/deploy.zip
To deploy the Lambda function, execute the following command:
aws cloudformation create-stack --stack-name <STACK_NAME> \
--template-body file://04-lambda/template.yml \
--parameters ParameterKey=CodeS3Bucket,ParameterValue='<S3_BUCKET_NAME>' \
ParameterKey=CodeS3Key,ParameterValue='lambda/copy-s3-to-redshift/deploy.zip' \
ParameterKey=SubnetIds,ParameterValue='<SUBNET_IDS>' \
ParameterKey=SecurityGroupIds,ParameterValue='<SECURITY_GROUPS_IDS>' \
ParameterKey=RedshiftIAMRole,ParameterValue='<REDSHIFT_IAM_ROLE>' \
ParameterKey=RedshiftHost,ParameterValue='<REDSHIFT_HOST>' \
ParameterKey=RedshiftPort,ParameterValue='<REDSHIFT_PORT>' \
ParameterKey=RedshiftUser,ParameterValue='<REDSHIFT_USER>' \
ParameterKey=RedshiftPassword,ParameterValue='<REDSHIFT_PASSWORD>' \
ParameterKey=RedshiftDatabase,ParameterValue='<REDSHIFT_DATABASE>' \
--capabilities CAPABILITY_NAMED_IAM \
--query 'StackId' --output text
aws cloudformation wait stack-create-complete --stack-name <S3_BUCKET_NAME>
aws cloudformation describe-stacks --stack-name <S3_BUCKET_NAME> --query 'Stacks[0].Outputs' --output table
The script above deploys a CloudFormation stack named <STACK_NAME> using the provided [template](./04-lambda/template.yml) and returns outputs from the deployed stack. The template defines a Python Lambda function that copies data from an S3 file to a Redshift table.
The parameters of the template can be split into three groups:
- The first group (CodeS3Bucket, CodeS3Key) determines the location where the zip archive with the function source code is stored.
- The second group (SubnetIds, SecurityGroupsIds) configures the VPC settings for the Lambda function. Ensure that network connections are possible between Redshift Serverless and the chosen subnets and security groups. Attaching the Lambda function to the VPC allows for connecting and executing statements on Redshift Serverless without making the database publicly accessible.
- The third group of parameters (RedshiftHost, RedshiftPort, RedshiftUser, RedshiftPassword, RedshiftDatabase) specifies the connection parameters for Redshift Serverless. The RedshiftIAMRole parameter determines the AWS Identity and Access Management (IAM) role used by Redshift Serverless to read data from the S3 bucket. For implementation details, refer to the function’s [handler](./04-lambda/src/lambda_function.py).
In the example below, the Lambda function is deployed into the public subnets with the public security group. The Redshift connection parameters is used from the deployment above:
aws cloudformation create-stack --stack-name club-innocent-s3-to-redshift-lambda \
--template-body file://04-lambda/template.yml \
--parameters ParameterKey=CodeS3Bucket,ParameterValue='club-innocent-us-east-1' \
ParameterKey=CodeS3Key,ParameterValue='lambda/copy-s3-to-redshift/deploy.zip' \
ParameterKey=SubnetIds,ParameterValue='subnet-0c82be60af536c872\,subnet-0efdf272744c33sse\,subnet-009d191fec1018eg2' \
ParameterKey=SecurityGroupIds,ParameterValue='sg-04aa035df3ff378ed' \
ParameterKey=RedshiftIAMRole,ParameterValue='arn:aws:iam::************:role/redshift-s3-access-titanic-us-east-1' \
ParameterKey=RedshiftHost,ParameterValue='club-innocentpassengers.************.us-east-1.redshift-serverless.amazonaws.com' \
ParameterKey=RedshiftPort,ParameterValue='5439' \
ParameterKey=RedshiftUser,ParameterValue='admin' \
ParameterKey=RedshiftPassword,ParameterValue='tsbHH8KUisPtXxje' \
ParameterKey=RedshiftDatabase,ParameterValue='dev' \
--capabilities CAPABILITY_NAMED_IAM \
--query 'StackId' --output text
aws cloudformation wait stack-create-complete --stack-name club-innocent-s3-to-redshift-lambda
aws cloudformation describe-stacks --stack-name club-innocent-s3-to-redshift-lambda --query 'Stacks[0].Outputs' --output table
-------------------------------------------------------------------------------------------------------------------------------------------
| DescribeStacks |
+------------------------------------------+---------------+------------------------------------------------------------------------------+
| Description | OutputKey | OutputValue |
+------------------------------------------+---------------+------------------------------------------------------------------------------+
| The name of the created Lambda Function | FunctionName | copy-from-s3-to-redshift |
| The ARN of the created Lambda Function | FunctionArn | arn:aws:lambda:eu-central-1:************:function:copy-from-s3-to-redshift |
+------------------------------------------+---------------+------------------------------------------------------------------------------+
Create Amazon Redshift Tables
To create tables (staging and production) in Redshift Serverless, the Redshift Data API can be use:
aws redshift-data execute-statement \
--workgroup-name '<WORKGROUP_NAME>' \
--database '<DATABASE_NAME>' \
--sql " \
CREATE TABLE <TABLE_NAME> ( \
<TABLE_STRUCTURE>
);" \
--query 'Id' --output text
The script above starts the creation of table <TABLE_NAME> and returns the statement execution ID. Ensure that appropriate permissions are configured for the Redshift Data API to execute the script seamlessly. The execution ID can be used to check the metadata about the execution or fetch the result of the execution:
aws redshift-data describe-statement \
--id <STATEMENT_ID> \
--query 'Status' --output text
aws redshift-data get-statement-result \
--id <STATEMENT_ID> \
--query 'Status' --output text
In the example below, two tables (staging and production) are created to support the [titanic.parquet](./titanic.parquet) file in the workgroup (passengers) and database (dev) created before:
aws redshift-data execute-statement \
--workgroup-name passengers \
--database dev \
--sql " \
CREATE TABLE staging ( \
passengerid bigint, \
survived bigint, \
pclass bigint, \
name character varying(256), \
sex character varying(256), \
age double precision, \
sibsp bigint, \
parch bigint, \
ticket character varying(256), \
fare double precision, \
cabin character varying(256), \
embarked character varying(256) \
);" \
--query 'Id' --output text
aws redshift-data execute-statement \
--workgroup-name passengers \
--database dev \
--sql " \
CREATE TABLE production ( \
passengerid bigint, \
survived bigint, \
pclass bigint, \
name character varying(256), \
sex character varying(256), \
age double precision, \
sibsp bigint, \
parch bigint, \
ticket character varying(256), \
fare double precision, \
cabin character varying(256), \
embarked character varying(256) \
);" \
--query 'Id' --output text
Following the execution, newly-appeared tables can now be found in the Redshift Query Editor:
Figure 3 – Redshift tables – created.
Trigger Lambda to Copy .parquet File from S3 into the Table
To invoke the Lambda function that copies the .parquet file, upload to Amazon S3 and execute the following statement:
aws lambda invoke \
--function-name <FUNCTION_NAME> \
--payload '{"table": "<TABLE_NAME>", "s3_file_url": "<S3_FILE_URL>"}' \
--cli-binary-format raw-in-base64-out \
<RESPONSE_FILE_PATH>
The script above invokes the Lambda function. The function expects to receive a JSON with input containing information about the source S3 .parquet file and the target table in Redshift Serverless. For more information, see the function’s [source code](./04-lambda/src/lambda_function.py).
In the example below, data from the previous upload .parquet file (s3://club-innocent-us-east-1/data/titanic.parquet) is copied into staging table:
aws lambda invoke \
--function-name arn:aws:lambda:eu-central-1:************:function:copy-from-s3-to-redshift \
--payload '{"table": "staging", "s3_file_url": "s3://club-innocent-us-east-1/data/titanic.parquet"}' \
--cli-binary-format raw-in-base64-out \
./response.json
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
After the Lambda function’s successful invocation, the staging table is populated with data:
Figure 4 – Redshift tables – populated.
Push data from Staging to Production Tables
To transfer data between tables, the Redshift Data API can be leveraged:
aws redshift-data execute-statement \
--workgroup-name '<WORKGROUP_NAME>' \
--database '<DATABASE_NAME>' \
--sql " \
INSERT INTO <DESTINATION_TABLE> (<COLUMN_1>, <COLUMN_2>, ...) \
SELECT <COLUMN_1>, <COLUMN_2>, ... \
FROM <SOURCE_TABLE>;" \
--query 'Id' --output text
The script above executes the INSERT INTO … SELECT … FROM SQL statement and obtains the execution ID. It can be used to retrieve the execution status:
aws redshift-data describe-statement \
--id <STATEMENT_ID> \
--query 'Status' --output text
In the example below, data from the staging table is copied into the production table:
aws redshift-data execute-statement \
--workgroup-name 'passengers' \
--database 'dev' \
--sql " \
INSERT INTO production ( \
passengerid, \
survived, \
pclass, \
name, \
sex, \
age, \
sibsp, \
parch, \
ticket, \
fare, \
cabin, \
embarked \
) \
SELECT \
passengerid, \
survived, \
pclass, \
name, \
sex, \
age, \
sibsp, \
parch, \
ticket, \
fare, \
cabin, \
embarked \
FROM staging;" \
--query 'Id' --output text
Following the CLI command execution, the production table is populated with data from staging table:
Figure 5 – Redshift tables – staging to production.
Create a Datashare for the Table
To create an AWS Data Exchange datashare, several steps should be executed:
- Create a datashare with access granted on AWS Data Exchange.
- Add objects to the created datashare.
- Authorize access for the datashares for AWS Data Exchange.
To create a datashare, execute the following command:
aws redshift-data execute-statement \
--workgroup-name '<WORKGROUP_NAME>' \
--database '<DATABASE_NAME>' \
--sql "CREATE DATASHARE <DATASHARE_NAME> MANAGEDBY ADX;" \
--query 'Id' --output text
In the example below, the datashare with name passengers is created:
aws redshift-data execute-statement \
--workgroup-name 'passengers' \
--database 'dev' \
--sql "CREATE DATASHARE passengers MANAGEDBY ADX;" \
--query 'Id' --output text
To add object to the created data share, execute the following command:
aws redshift-data execute-statement \
--workgroup-name '<WORKGROUP_NAME>' \
--database '<DATABASE_NAME>' \
--sql " \
ALTER DATASHARE <DATASHARE_NAME> ADD SCHEMA <SCHEMA_NAME>; \
ALTER DATASHARE <DATASHARE_NAME> ADD TABLE <TABLE_NAME>; \
ALTER DATASHARE <DATASHARE_NAME> ADD ALL TABLES IN SCHEMA <SCHEMA_NAME>;" \
--query 'Id' --output text
The command above alters the existing datashare by adding various database objects to it. For more information about the types of objects that can be added, refer to the documentation.
In the example below, the table production is added to the datashare:
aws redshift-data execute-statement \
--workgroup-name 'passengers' \
--database 'dev' \
--sql "\
ALTER DATASHARE passengers ADD SCHEMA public; \
ALTER DATASHARE passengers ADD TABLE public.production;" \
--query 'Id' --output text
To authorize AWS Data Exchange to have access to the datashare and return association status, the following CLI command should be executed:
aws redshift authorize-data-share \
--data-share-arn '<DATA_SHARE_ARN>' \
--consumer-identifier ADX
To obtain the datashare’s Amazon Resource Name (ARN), you can execute the following statement:
aws redshift describe-data-shares \
--query 'DataShares[].DataShareArn' --output text
In the example below, the created datashare passengers is authorized for AWS Data Exchange:
aws redshift describe-data-shares \
--query 'DataShares[].DataShareArn' --output text
arn:aws:redshift:eu-central-1:************:datashare:bd003563-827b-4abb-aca9-0aff032fd3ba/passengers
aws redshift authorize-data-share \
--data-share-arn 'arn:aws:redshift:eu-central-1:************:datashare:bd003563-827b-4abb-aca9-0aff032fd3ba/passengers' \
--consumer-identifier ADX \
--query 'DataShareAssociations[].Status' --output text
AUTHORIZED
To confirm the state of the datashare, the AWS Management Console can be used:
Figure 6 – Redshift – Data Share in AWS console.
Amazon Redshift Data Share is now ready to be collateralized as a product for sale on the AWS Data Exchange. Learn more in the AWS documentation.
Conclusion
Healthcare and life sciences companies are seeking time-to-value data products to glean insights quickly. With AWS Data Exchange and Amazon Redshift products, data subscribers can realize live, queryable data that’s transactionally consistent, without having to do any of the ETL lifting.
In this post, we shared how a data provider is able to stand up an Amazon Redshift Serverless Namespace and Workgroup via AWS Cloudformation. A data provider is then able to load data from an Amazon S3 bucket into Redshift via open-source database adapter, psycopg2, deployed as a Python function in AWS Lambda.
We showed how data providers can load to a staging table and then push final products to a production table. Finally, to commercialize a Redshift data product, we showed how to cut a Redshift Data Share for the AWS Data Exchange.
Verana Health – AWS Partner Spotlight
Verana Health is an AWS Partner revolutionizing patient care and clinical research by unlocking the potential of real-world data.