AWS for Industries
How to set-up a fully automated data pipeline from AWS Data Exchange to Amazon FinSpace
In previous posts we’ve presented scenarios where Amazon FinSpace data analysis capabilities are used to address different use cases. For these analyses we used data available on AWS Data Exchange and on third-party data sources. Some examples of analysis are what-if scenarios of trading strategies, ESG portfolio optimization, and Analyzing petabytes of trade and quote data.
One key aspect in these analyses is data availability in FinSpace. There are different options to access data using FinSpace: you can access data in-place from FinSpace or ingest data into FinSpace. In the latter case, you can do it both manually, using FinSpace web interface, and programmatically, using both FinSpace API or using FinSpace Jupyter notebooks.
In a previous post, we’ve explained how to ingest data into FinSpace programmatically and manually.
In this post, we’re showing how to quickly create a pipeline that inserts datasets from AWS Data Exchange into FinSpace. This pipeline also gives you the option to manage periodical updates, meaning that as soon as a data update is available in AWS Data Exchange, the data update is immediately pushed into FinSpace.
Additionally, this integration is available with an Infrastructure-as-Code (IaC) approach, so you can one-click deploy the integration with the dataset of your choice.
Architecture
The following image shows the high-level functional workflow of the data pipeline described and implemented in this post.
Figure 1, Functional view of the automated data pipeline to ingest data into FinSpace
In the following you have a description of each step:
1 Subscribe to AWS Data Exchange dataset: With this step you subscribe to one of the data products available on AWS Data Exchange.
There are different options that data providers can choose to deliver their data product via AWS Data Exchange. For this post, we’ll consider the dataset delivered on the Amazon Simple Storage Service (Amazon S3) bucket. This means that when you subscribe to an AWS Data Exchange data product, this will be delivered to an S3 bucket in your AWS account.
2 Persist dataset (raw data): This is the material step of persisting the AWS Data Exchange data product, which you subscribed at previous step, on your S3 bucket.
3 Start input process: This is an automated step that triggers the automated data ingestion process into FinSpace.
In some cases, like dataset corrections or one-time loading, you may want to start this process manually instead of automatically. This is possible and we’ll provide more details later in this post.
4 Data preprocessing: When you insert data into FinSpace, different format and data types are supported. With this step we’re simply applying any data transformation required to insert the dataset into FinSpace. For example, suppose that the dataset you get from AWS Data Exchange is a compressed (i.e., zip or other) csv file updated daily. In this case, data preprocessing will extract the CSV file. Since CSV is one of the FinSpace supported file formats, no other transformations are required, and you can pass it to the next step.
Consider that with this step you may want to implement additional logic. For example, you may want to apply your data validation logic, such as checking data quality (percent of null values, any unbalanced data, etc.) before deciding to insert it into FinSpace.
5 Get metadata and ingest into FinSpace: Before materially inserting the data into FinSpace, this integration will retrieve dataset metadata information.
If this is a new dataset, then metadata will include data set title, description, columns descriptions, and FinSpace categories and attributes to semantically qualify the dataset content. Therefore it’s easier for FinSpace business users to find this dataset in the FinSpace data catalog.
If the dataset is an update for an existing dataset, then the metadata will contain the dataset ID, append or replace mode, and other information useful for executing the update.
6 FinSpace Data Catalogue: Once this step is completed, the dataset is successfully inserted into FinSpace and can be searched by FinSpace users using FinSpace data catalog and can be analyzed by analysts and engineers using FinSpace managed Jupiter notebooks
7 Data post-processing: This step executes any processing required after the dataset is inserted into FinSpace. For example, referring to the compressed csv file example in Step 4, after you inserted the dataset into FinSpace, you may want to delete the csv file (you can extract it again anytime) and you may want to move the compressed file to another S3 bucket dedicated to historical files.
Consider that you can also have different post-processing logic, for example for successful and unsuccessful dataset insert.
Figure 2, High-level architecture view of the automated data pipeline to ingest data into FinSpace
The previous figure maps the steps described above, and shown in Figure 1, to the high-level technical architecture implementing the integration.
Step 1 and 2 can be implemented with different options: one is covered in this post, which describes and deploys an automation to retrieve new dataset revisions to Amazon S3 automatically. Another option is to use AWS Data Exchange Auto-export job, which is described in this post.
For the purposes of this post, we’ll show you how to deploy the complete automated data pipeline using AWS Data Exchange Auto-export job.
Data preprocessing options
When you want to integrate a dataset into FinSpace, there are two possible scenarios:
First, in the following figure: the dataset that you want to insert into FinSpace is ready to be inserted, meaning that you don’t need any data transformation to prepare the dataset for insertion into FinSpace. In this case, you can leave the dataset in the “raw data” bucket and it will be inserted as-is. After the insertion into FinSpace, the Data post-processing step can move this dataset to an historical/back-up bucket.
Figure 3, Architecture view of the data pipeline, with no data transformation required on the dataset to insert it into FinSpace
Second, in the following figure: the dataset that you want to insert into FinSpace isn’t ready to be inserted, meaning that you need some data transformation to prepare the dataset for insertion, such as unzipping the file, converting the file type, and/or merging multiple source files into one. These data transformations will be done by the “Data preprocessing” step and will save the results into the “Curated data” bucket. After the insertion into FinSpace, the “Data postprocessing” step will apply the logic required, for example removing the files in the “Curated data” bucket and moving the files from the “raw data” bucket into an historical/back-up bucket.
Figure 4, Architecture view of the data pipeline, with data transformation required on the dataset to insert it into FinSpace
Let’s review Steps 3-7 from a technical standpoint:
Step 3: The workflow starts from an Amazon EventBridge event that can be generated in different ways. First, you can manually create an EventBridge event that will trigger an AWS Step Functions workflow execution. This is useful for test purposes. Second, EventBridge event can be scheduled using the Amazon EventBridge scheduler functionality. Third, you can configure an S3 bucket so that, as soon as a new file is persisted on Amazon S3, an event is triggered by Amazon S3 and handled by EventBridge, which triggers a Step Functions workflow.
Step 4: Data preprocessing AWS Lambda function will get file(s) content and apply any data transformation required and/or any additional logic (i.e., data quality checks) that you may need. If the source data is transformed, then the resulting dataset will be saved in a “Curated data” bucket.
Step 5: The Lambda function will connect to Amazon DynamoDB to get dataset metadata information and will call FinSpace API to insert the dataset. If the dataset is new, then a new FinSpace dataset and data view will be created and any FinSpace categories and attributes required will be added to the dataset.
Step 6: FinSpace will perform the operation requested, creating a new changeset and performing any additional operation required (data view, attributes, and categories).
Step 7: The “Data Postprocessing” Lambda function will move files from the “raw data” bucket into an historical/back-up bucket and, in the case that the dataset was transformed during preprocessing, it will remove the files in “Curated data” bucket.
Dataset used
We’ll use the following dataset to explain how to deploy the integration:
Daily Treasury Maturities | Federal Reserve Board from Rearc
20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap from Alpha Vantage.
Deploy AWS Data Exchange – FinSpace integration
This section describes how to deploy AWS Data Exchange – FinSpace integration. Integration code and configurations, described and used in this blog, are available on this GitHub repository.
Step 1 and 2 can be implemented with different options: one is covered in this post, which describes and deploys an automation to retrieve new dataset revisions to Amazon S3 automatically. Another option is to use AWS Data Exchange Auto-export job, which is described in this post.
For the purposes of this post, we’ll show you how to deploy the complete automated data pipeline using AWS Data Exchange Auto-export job.
To deploy the integration, we use CloudFormation templates. Specifically, we provide the following:
- One “core” AWS CloudFormation template (or core template), which deploys the integration components common to all dataset integrations, such as Step Functions workflow, DynamoDB table, and Lambda function interacting with FinSpace and DynamoDB, number 5 in Figure 4.
- One “dataset” CloudFormation template (or dataset template), which adds, to the “core” template, the integration components required to integrate the specific dataset required. This includes preprocessing and post-processing Lambda and dataset metadata on DynamoDB.
In this post we’ll show how to deploy the integration for two different datasets (Alpha Vantage and Rearc). In turn, we’ll deploy the core template and two different dataset templates, one for Alpha Vantage and one for Rearc.
The two dataset templates are totally independent, meaning that you can deploy the first, the second, or both in any order (the same for undeploy) with no impact between each-other.
Therefore, you can implement and deploy as many dataset templates as required and you’ll be able to see, on the AWS Management Console, the list of datasets that you’ve deployed, and update/undeploy them independently.
In reference to Figure 4, Step 3, the Alpha Vantage dataset integration workflow is started by a scheduled EventBridge event, and the Rearc dataset integration workflow is started by an Amazon S3-generated EventBridge event.
The following workflow diagram provides a holistic view of the deployment process:
Figure 5, Automated data pipeline deployment workflow
In the following, you see how the 4 points in Figure 5 are mapped and explained in the chapters of this post.
Point 1 is explained in “Create Amazon FinSpace environment” and “Create Amazon FinSpace user for the integration”.
Point 2 is explained in “Deploy integration core AWS CloudFormation template”.
For the Rearc dataset:
Point 3 is explained in task a) of “Deploy Daily Treasury Maturities | Federal Reserve Board Rearc integration”.
Point 4 is explained in task b) of “Deploy Daily Treasury Maturities | Federal Reserve Board Rearc integration”.
For the Alpha Vantage dataset:
Point 3 is explained in task a) of “Deploy 20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap integration”.
Point 4 is explained in task b) of “Deploy 20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap integration”.
Create FinSpace environment
To deploy the integration, you must have one FinSpace environment in the us-east-1 region. If you don’t have it, then follow the procedure in this chapter to create it.
- Follow this guide to create a new FinSpace environment.
At Step 1, make sure to select the us-east-1 region.
At Step 6 (authentication methods selection), you can select “Email and password”. It’s also possible to configure Single-Sign-On (SSO), but for the purposes of this post we’ll use “Email and password”.
After the environment is created, an “Environment domain” URL will be generated which is the sign-in URL for your FinSpace web application.
- Note the sign-in URL FinSpace web application and credentials to sign-in.
Create a FinSpace user for the integration
- Log in to the FinSpace user interface using the “Environment domain” URL.
Select the top-right gear icon and then select “Users and Groups”, as follows:
Figure 6: Figure showing how to select “Users and Groups” menu item
- Select Add User.
- Fill the user fields as follows:
“First Name” = “adx_api_access”
“Last Name” = “adx_api_access”
“Email” = your email. This doesn’t need to be the email associated with your AWS account
“Superuser” = Yes
“Programmatic Access” = Yes
“IAM Principal ARN” = arn:aws:iam::<YOUR ACCOUNT ID>:role/adx-finspace-integration-importdataset-lambda-role
You must replace <YOUR ACCOUNT ID> with the 12 digit AWS account ID used to create the FinSpace environment
Figure 7: Figure showing how to configure the FinSpace user used by the integration
- Select the top-right gear icon and then select “Users and Groups”, as in Figure 6.
- Select the name of the user just created, as in Figure 8.
Figure 8: The integration will use this user
- Select “Add User to a Group” and then select “FinSpace Administrators” as shown in Figure 9.
Figure 9: FinSpace user, used by the integration, is added to FinSpace Administrators Group
- Note the FinSpace Administrators Group ID. We’ll use it later when deploying the dataset integration CloudFormation template.
Figure 10: On this page we can see the Group ID of the programmatic user used by the integration
Deploy integration core CloudFormation template
- Select the following button to deploy the integration core template (CloudFormation stack name is “FinSpace-Core”).
When the deployment is complete, you’ll see the green “CREATE COMPLETE” message on the CloudFormation console: now you can proceed to deploy datasets templates.
Deploy “Daily Treasury Maturities | Federal Reserve Board Rearc” integration
Once the integration core template is successfully deployed, you can deploy the integration for “Daily Treasury Maturities | Federal Reserve Board” from Rearc.
There’re two main tasks to be executed:
- Deploy the CloudFormation template for dataset integration, covering points 3-7 of Figure 2.
- Subscribe to “Daily Treasury Maturities | Federal Reserve Board” dataset and configure auto-export job destination, covering points 1 and 2 of Figure 2.
Let’s start with task a).
- Select the following button to deploy the integration for “Daily Treasury Maturities | Federal Reserve Board” dataset.
You’ll get to the CloudFormation “Quick create stack” web page.
- Fill the CloudFormation Parameters as follows:
- CoreStackName: This is the name of the integration core CloudFormation template, deployed in the paragraph above. This field is automatically filled with “FinSpace-core”, so leave it as it is, as this field is case sensitive.
EnvironmentInfrastructureAccount: Fill this field with your FinSpace Environment infrastructure account. To find this value, go to the FinSpace console, select the “Environment name” that you’re using and copy the “Environment infrastructure account” value, as shown in the following figure.
Figure 11: FinSpace Environment infrastructure account is available in the environment Summary.
-
- FinspaceDatasetOwnerEmail, FinspaceDatasetOwnerName and FinspaceDatasetOwnerPhone: insert the contact details of the FinSpace owner of the dataset, these contact details will be shown in the FinSpace Dataset details page.
It’s also possible to change these values later. The phone number must be at least 10 digits. - FinspaceDomainId: If your FinSpace URL is https://wekd36dsnwmsapd6u5imb.us-east-1.amazonfinspace.com, then the value of FinspaceDomainId is “wekd36dsnwmsapd6u5imb”.
- finspaceGroupId: This is the ID of the “FinSpace Administrators” group. We noted this value at the end of the “Create Amazon FinSpace user for the integration” chapter, Figure 10; use it to fill this field.
- finspaceRegion: This is automatically filled with “us-east-1”, leave it as is.
- S3RearcDataFilesBucket: This is the S3 Bucket that will receive Rearc data files, including periodical updates. The CloudFormation template will create a new S3 bucket with the name that you specify in this field. Make sure to use an S3 Bucket name that isn’t already in use, otherwise the bucket creation will fail, thus blocking the deployment.
- FinspaceDatasetOwnerEmail, FinspaceDatasetOwnerName and FinspaceDatasetOwnerPhone: insert the contact details of the FinSpace owner of the dataset, these contact details will be shown in the FinSpace Dataset details page.
- Check the box “I acknowledge that CloudFormation might create IAM resources with custom names.”, then select the button “Create stack”.
When the deployment is complete, you’ll see the green “CREATE COMPLETE” message on the CloudFormation console.
Let’s proceed with task b).
We’re going to subscribe to “Daily Treasury Maturities | Federal Reserve Board” dataset .
- Go to Daily Treasury Maturities | Federal Reserve Board and select “Continue to subscribe”.
- Review sections: “1. Product offers” and 2 “Subscription terms”, “Data set”, and “Support information”.
- In Section 3 “Offer auto-renewal”, choose whether you want to enable auto-renewal when the subscription expires. The dataset integration can be deployed and run with offer auto-renewal active or inactive.
- Select “Subscribe”. The subscription process may take a couple of minutes to complete.
Go to AWS Data Exchange – Entitled data. Once the subscription process is completed successfully, you’ll see a “Daily Treasury Maturities | Federal Reserve Board” item under “Entitled data” – “Products”.
Figure 12: Selection of the subscribed product
- Select this item, then select “Actions” -> “Add auto-export job destination”.
Figure 13: Add auto export job button
- Select “Simple”, as in the following image, and fill “Select Amazon S3 bucket folder destination” with the bucket name you used for field “S3RearcDataFilesBucket”, during task a).
Figure 14: First part of the configuration of auto-export job destination
- You must update the target S3 bucket policy to grant AWS Data Exchange permissions to update files in your target S3 bucket. To update the bucket policy, select the “Bucket resource policy statement” shown in the box “Amazon S3 bucket resource policy dependency” and follow the instructions.
Figure 15: S3 bucket policy for auto-export configuration
- It’s recommended that you encrypt the content on the S3 bucket either by using an encryption key fully managed by Amazon S3 (SSE-S3) or by using a key stored in AWS Key Management Service (AWS KMS) in your account (SSE-KMS). For the purpose of this post, we’ll use a fully-managed Amazon S3 key. Therefore, select the SSE-S3 option.
Figure 16: Amazon S3 encryption options
- Finally, select the “Add bucket destination” bucket to finish the auto export configuration.
If you need more information about the AWS Data Exchange subscription, then check the documentation. Additional information is available in this post which describes and provides a step-by-step guide on how to configure an auto-export job.
Once the auto-export job is completed, the Rearc dataset, which is composed of a single CSV file, will be persisted on your S3 bucket. This will trigger the generation of an Amazon S3 event. EventBridge will receive this Amazon S3 event and will use it to start the Step Functions workflow that will ingest the Rearc dataset on Amazon S3 into FinSpace.
Once the Step Functions workflow execution is completed, the dataset is imported into and available in FinSpace.
You can check the Step Functions workflow as follows:
- Go to the Step Functions console.
- Select the name of the state machine. The state machine name will be similar to “adx-finspace-integration”.
- Under the “Executions” panel, select the last execution.
You’ll see an image similar to the following, which will show you the execution status.
Figure 17: Graph view of Step Functions state machine execution
You can now access your dataset in FinSpace, to do this:
- Access the FinSpace web UI.
- On the FinSpace home page, select “Recent Data” at the top left of the screen.
- Select the Rearc Daily Treasury Maturities | Federal Reserve Board and dataset.
You’ll see the home page of the dataset, like in the following image:
Figure 18: Rearc dataset is available on FinSpace
You can now start working with this dataset in FinSpace.
Deploy the “20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap” integration
In this chapter, we demonstrate how to deploy the integration for “20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap” from Alpha Vantage.
The deployment steps for this integration are similar to the deployment steps provided in the previous chapter “Deploy Daily Treasury Maturities | Federal Reserve Board Rearc integration”. We’ll explain in this chapter the steps which are different between the two chapters, and we’ll refer to the previous chapter for steps which are identical.
There are two main tasks to be executed:
- Deploy the CloudFormation template for dataset integration, covering points 3-7 of Figure 2.
- Subscribe to the “20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap” dataset and export the dataset to Amazon S3. This will cover point 1 and 2 of Figure 2.
Let’s start with task a)
- Select the following button to deploy the integration for the “20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap” dataset.
After selecting this button you’ll get to the CloudFormation “Quick create stack” web page. Follow the instructions described in task a) in the previous chapter to complete the CloudFormation deployment.
Once you’ve completed task a), you’ll see the green “CREATE COMPLETE” message on the CloudFormation console.
Let’s proceed with task b)
We’ll subscribe to the “20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap” dataset.
- Go to 20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap and select “Continue to subscribe”.
- Follow the instructions described in task b) in the previous chapter to complete the subscription to this dataset and to configure the auto-export job that will export the dataset from AWS Data Exchange to the S3 bucket that you created in task a) of this chapter, when deploying the dataset to CloudFormation.
Once the auto-export job execution is completed, the Alpha Vantage dataset, which is composed by ten different CSV files, will be persisted on your S3 bucket.
In this case, EventBridge isn’t configured to automatically trigger the integration workflow when the Alpha Vantage files are persisted on Amazon S3. EventBridge is configured to start the integration workflow when the trigger event is created by a schedule configured in EventBridge. We made this choice because the Alpha Vantage dataset is composed of ten files and we don’t want to trigger one integration workflow, and it creates a FinSpace Changeset for each file persisted in Amazon S3. Instead we want to have a single FinSpace dataset with one Changeset with all of the data contained in the ten files. To do this, we configure a schedule that will run after all ten files have been persisted on Amazon S3.
If you want to check or change the schedule that triggers the event and thus the integration workflow, follow these instructions:
- Select here to go to the EventBridge rules.
- Select the rule with a name like “FinSpace-alphavantage-20y-avPublishEvent-YYYY”, where YYYY is a text string.
- You’ll see Event schedule set to “29 15 11 05 ? 2022”. You can change it following the Cron syntax and according to your needs.
It’s also possible to generate an event manually. This triggers the integration workflow immediately, as shown in the following image.
Figure 19: Alpha Vantage integration is configured to run on a schedule, but it’s also possible to start it manually, as shown in this image
Once the Step Functions workflow execution is started, you can monitor the status as follows:
- Go to the Step Functions console.
- Select the name of the state machine. The state machine name will be “adx-integration-finspace-alphavantage-state-machine”.
- Under the “Executions” panel, select the last execution.
Selecting “Graph view”, you’ll see an image showing the execution status, similar to Figure 17.
Once the workflow execution is successfully completed, you can access your Alpha Vantage dataset in FinSpace by following these steps:
- Access the FinSpace web UI.
- On the FinSpace home page, select “Recent Data”, at the top-left of the screen.
- Select the “20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap” dataset.
You’ll see the homepage of the dataset, similar to the one in Figure 18.
How to extend this integration to other datasets
In this post, we’ve shown how to deploy the ready-to-use integration for Rearc and Alpha Vantage datasets. What is required to create an integration for another dataset?
In this chapter we provide a brief overview of the steps required to leverage the resources provided in this post to create the integration for another dataset.
1) FinSpace-Core integration: As you see in Figure 5, you can use the FinSpace-Core CloudFormation template as-is.
The FinSpace-Core stack will create the DynamoDB table that will contain the metadata of your dataset. Each dataset will have its own row in this DynamoDB table.
Select here to go to the DynamoDB tables and select “ADX” table. You’ll see one item for each dataset. For example, the DynamoDB item created when you deploy the Rearc integration has the partition key “arn:aws:s3:::adx-integration-rearc-dataset”.
The FinSpace-Core stack will create the Lambda function used to call FinSpace API to import the dataset and to check that the import was completed successfully. Additionally, the FinSpace-Core stack will create other resources required for the integration to work, such as the AWS Identity and Access Management (IAM) Role and Security Policy for the resources used.
2) Dataset integration: You can take one of the two datasets integration provided (Rearc or Alpha Vantage CloudFormation templates) and apply the changes that you need for your dataset.
If you want to create an event-driven integration, meaning that the integration workflow will start automatically as soon as a new file is persisted on the S3 dataset bucket, then you can start from Rearc integration.
This option is useful when you want to implement near real-time data processing and integration and/or event-driven architectures.
If you want to create a batch/scheduled integration, meaning that the integration workflow will start on a schedule you configure, then you can start from Alpha Vantage integration.
For example, this option is useful when you have different files that will be persisted to the S3 dataset bucket and you want to group all of these files in a single FinSpace Changeset. In this case, you must wait to have all of the files persisted on Amazon S3, and then start the integration.
All of the dataset integration changes described as follows must be reflected in the CloudFormation template you’ll create for your dataset. For example, at point 3.3, we describe the DynamoDB item-containing dataset metadata. The CloudFormation dataset you’ll create must deploy the DynamoDB item containing your dataset metadata. This applies to all of the dataset integration changes.
3) Dataset integration changes: The following bullet points list the key items that you must change to adapt one of the existing dataset integrations to a new dataset.
3.1) Preprocessing AWS Lambda function. Go to the Lambda console.
If you’ve deployed Alpha Vantage integration, then you’ll see a Lambda function named “FinSpace-alphavantage-20y-avPreprocLambda-YYYY”, where “YYYY” is a text string. If you select that function, then you’ll see the code used to concatenate all of the files of the Alpha Vantage dataset. This preprocessing is required to have a single dataset to be inserted into FinSpace.
If you’ve deployed Rearc integration, then you’ll see a Lambda function named “FinSpace-rearc-dtm-rearcPreprocLambda-YYYY”, where “YYYY” is a text string. If you select that function, then you’ll see the Rearc dataset preprocessing code. You’ll notice this code is simpler than the Alpha Vantage preprocessing code, because the Rearc dataset can be inserted into FinSpace as-is. Therefore, no data transformation, or preprocessing, is required.
Depending on the dataset that you want to insert, you may need to implement a different logic from the one implemented for the Alpha Vantage dataset. For example, you may have a zipped file containing the dataset that you want to insert into FinSpace. Therefore, you’ll have to implement the code to extract the dataset from the zipped file.
What happens if you must implement a long-running (i.e., potentially hours) preprocessing job that’s incompatible with Lambda Function timeout of 15 minutes?
In this case, you can use another AWS service instead of Lambda, such as AWS Glue or Amazon EMR.
In this way, you can have long-running execution implemented in languages such as Python, PySpark, Scala, or Spark.
Changing Lambda with another service is possible because the orchestration of the integration workflow is done by Step Functions, which can execute tasks (such as preprocessing, in our case) leveraging AWS Lambda native integration, AWS Glue native integration, Amazon EMR native integration, and many other service integrations.
3.2) Post-processing the Lambda function. All that you see in 3.1) for Preprocessing the Lambda function is true also for post-processing the Lambda function. The difference is that post-processing the Lambda function implements the application logic needed to run after the dataset has been ingested into FinSpace.
For the purposes of this post, we’ve implemented a post-processing Lambda function that removes the input files from the source S3 bucket.
Depending on your requirements, you may want to implement a different post-processing logic. For example, you may want to move the input files to a History S3 bucket, for later reference, instead of removing them.
As a reference, you can check the code implemented for Rearc and Alpha Vantage post-processing and change it to suit your requirements. To check the code, you can go to the Lambda console and:
If you’ve deployed Alpha Vantage integration, then you’ll see a Lambda function named “FinSpace-alphavantage-20y-avPostprocLambda–YYYY”, where “YYYY” is a text string. This is the post-processing Lambda Function for the Alpha Vantage dataset.
If you’ve deployed the Rearc integration, then you’ll see a Lambda function named “FinSpace-rearc-dtm-rearcPostprocLambda-YYYY”, where “YYYY” is a text string. This is the post-processing Lambda Function for the Rearc dataset.
3.3) Dataset metadata stored in DynamoDB. Each dataset has its own metadata, which is stored in a DynamoDB item that’s unique for each dataset. All of these items are stored in a DynamoDB table called “ADX”, which is deployed with the FinSpace Core CloudFormation template.
When creating the integration for a new dataset, you must change the CloudFormation template implementing the dataset integration, such as the Rearc or Alpha Vantage CloudFormation template, to deploy a new DynamoDB item with all of the dataset metadata relevant for your dataset.
You can check the DynamoDB item created for Rearc by following these steps. This also applies for the Alpha Vantage DynamoDB item, but the item key name is different.
- Go to the DynamoDB console. If you’ve successfully deployed Rearc or Alpha Vantage or both datasets, then you’ll see an “ADX” table.
Select “ADX”, then select the button “Explore table items”. You’ll see the ADX item list, as show in the following figure.
Figure 20: Select the item highlighted to view/edit Rearc Daily Treasury Maturities metadata
- Select the item highlighted in Figure 20 and you’ll see all of the field and values contained in this item, as shown in Figure 21.
Figure 21. Fields and values of Rearc DynamoDB item
It’s important to not change the top-level field names, such as “finspaceDatasetKeyColumns”, “finspaceDatasetOwnerInfo” or “finspaceDatasetPermissions”. This is because these items are referenced by Lambda functions orchestrated by Step Functions.
For a new dataset, you must change the dataset specific metadata, such as all of the dataset column names, in finspaceDatasetKeyColumns, the dataset title, and dataset description.
These configurations must be made on the CloudFormation template that will deploy the new dataset integration, along with this new DynamoDB item.
3.4) Amazon EventBridge configuration. As described in Step 3 of Figure 4, EventBridge is required to start the integration workflow and you have two main options, described in the point “2) Dataset integration” above.
To check the two possible EventBridge configuration options:
- Select here to go to the EventBridge rules.
- If you’ve deployed the Alpha Vantage integration, then select the rule with the name such as “FinSpace-alphavantage-20y-avPublishEvent-YYYY”, where YYYY is a text string. Here you can review and edit the schedule configuration. Figure 19 shows this configuration.
- If you’ve deployed the Rearc integration,then select the rule with a name such as “FinSpace-rearc-dtm-RearcPublishEvent-YYYY”, where YYYY is a text string. Here you can review and edit the schedule configuration. The following figure shows this configuration.
Figure 22. EventBridge configuration for Rearc dataset
If you check the “Event pattern” in Figure 22, then you see that this EventBridge rule is triggered when EventBridge receives an Object Created event from the S3 bucket “rearc-bucket-101”. This rule doesn’t consider events coming from buckets with object key starting with “_”.
To generate this event, the S3 Rearc bucket must be configured to send notifications to EventBridge for events generated in the bucket. This configuration is deployed with Rearc integration, using the Rearc CloudFormation template.
You can check this configuration by going to the Amazon S3 console, selecting your S3 Rearc bucket, and then selecting “Properties” -> “Amazon EventBridge” configuration.
Clean-up
If you don’t want to continue using FinSpace, and you want to remove all of the integration and resources deployed with this post (data, integrations, etc.), then you can follow this procedure, showing how to clean up the resources:
1) Delete the FinSpace environment. Note that when you delete an environment, you’re deleting all of the assets (Jupyter notebooks, data, etc.) that you have in that environment. To delete the FinSpace environment:
- Go to the FinSpace landing page, and select Environments.
- Select the environment name, and then select the “Delete” button.
2) Delete the CloudFormation stacks deployed. First, delete the dataset stacks (Rearc and Alpha Vantage), and then delete the Finspace-Core stack. For detailed instructions on how to delete a CloudFormation Stack, check this guide.
3) Delete the Rearc and Alpha Vantage datasets that you put on the Amazon S3 buckets.
4) Unsubscribe from the Rearc and Alpha Vantage datasets in AWS Data Exchange by following the Unsubscribe from a product guide.
Conclusion
In this post, we’ve seen how to deploy an automated integration to ingest data into FinSpace from AWS Data Exchange or any dataset persisted on Amazon S3.
The integration can be event-driven or batch, depending on your requirements, and it can implement logics like data validation/quality, data enrichment, and/or data transformation on the dataset, before ingesting it into FinSpace.
We’ve seen how to use the one-click deploy links provided in this post to deploy the integration in your AWS Account.
We’ve also seen how to extend this integration framework to your datasets, to other AWS Data Exchange datasets, or other third-party datasets.
Integration code and configurations, described and used in this blog, are available on this GitHub repository, and we’ve seen how to extend this integration framework to your datasets, to other AWS Data Exchange datasets or other third-party datasets.
Once data is available in FinSpace, you can begin analyzing it in FinSpace by using notebooks provided in the FinSpace github repo.
We welcome any feedback on additional features or extensions that you’d like to see in this integration.