AWS Machine Learning Blog
Accelerate the investment process with AWS Low Code-No Code services
The last few years have seen a tremendous paradigm shift in how institutional asset managers source and integrate multiple data sources into their investment process. With frequent shifts in risk correlations, unexpected sources of volatility, and increasing competition from passive strategies, asset managers are employing a broader set of third-party data sources to gain a competitive edge and improve risk-adjusted returns. However, the process of extracting benefits from multiple data sources can be extremely challenging. Asset managers’ data engineering teams are overloaded with data acquisition and preprocessing, while data science teams are mining data for investment insights.
Third-party or alternative data refers to data used in the investment process, sourced outside of the traditional market data providers. Institutional investors are frequently augmenting their traditional data sources with third-party or alternative data to gain an edge in their investment process. Typically cited examples include, but are not limited to, satellite imaging, credit card data, and social media sentiment. Fund managers invest nearly $3 billion annually in external datasets, with yearly spend growing by 20–30 percent.
With the exponential growth of available third-party and alternative datasets, the ability to quickly analyze whether a new dataset adds new investment insights is a competitive differentiator in the investment management industry. AWS no-code low-code (LCNC) data and AI services enable nontechnical teams to perform the initial data screening, prioritize data onboarding, accelerate time-to-insights, and free valuable technical resources—creating an enduring competitive advantage.
In this blog post, we discuss how, as an institutional asset manager, you can leverage AWS LCNC data and AI services to scale the initial data analysis and prioritization process beyond technical teams and accelerate your decision-making. With AWS LCNC services, you are able to quickly subscribe to and evaluate diverse third-party datasets, preprocess data, and check their predictive power using machine learning (ML) models without writing a single piece of code.
Solution overview
Our use case is to analyze the stock price predictive power of an external dataset and identify its feature importance—which fields most impact the stock price performance. This serves as a first-pass test to identify which of the multiple fields in a dataset should be more closely evaluated using traditional quantitative methodologies to fit with your investment process. This type of first-pass test can be done quickly by analysts, saving time and letting you more quickly prioritize dataset onboarding. Also, while we are using stock price as our target example, other metrics such as profitability, valuation ratios, or trading volumes could also be used. All datasets used for this use case are published in AWS Data Exchange.
The following diagram explains the end-to-end architecture and the AWS LCNC services used to drive the decisions:
Our solution consists of the following steps and solutions:
- Data ingestion: AWS Data Exchange for subscribing to the published alternative datasets and downloading them on to Amazon Simple Storage Service (Amazon S3) bucket.
- Data engineering: AWS Glue DataBrew for data engineering and transformation of the data stored in Amazon S3.
- Machine learning: Amazon SageMaker Canvas for building a time series forecasting model for prediction and identifying the impact of data on the forecast.
- Business intelligence: Amazon QuickSight or Amazon SageMaker Canvas to review feature importance to the forecast for decision-making.
Data ingestion
AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. You can browse through the AWS Data Exchange catalog and find data products that are relevant to your business and subscribe to the data from the providers without any further processing, and no need for an ETL process. Note that many providers offer free initial subscriptions, which allow you to analyze their data without having to first incur upfront costs.
For this use case, search and subscribe to the below datasets in AWS Data Exchange:
- 20 Years of End-of-Day Stock Data for Top 10 US Companies by Market Cap published by Alpha Vantage. This free dataset contains 20 years of historical data for the top 10 US stocks by market capitalization as of September 5, 2020. The dataset contains the following 10 symbols—AAPL: Apple Inc.; AMZN: Amazon.com, Inc.; BRK-A: Berkshire Hathaway Inc. (Class A); FB: Facebook, Inc.; GOOG: Alphabet Inc.; JNJ: Johnson & Johnson; MA: Mastercard Incorporated; MSFT: Microsoft Corporation V: Visa Inc.; and WMT: Walmart Inc.
- Key data fields include
- Open: as-traded opening price for the day
- High: as-traded high price for the day
- Low: as-traded low price for the day
- Close: as-traded close price for the day
- Volume: trading volume for the day
- Adjusted Close: split and dividend-adjusted closing price of the day
- Split Ratio: ratio of new to old number of shares on the effective date
- Dividend: cash dividend payout amount
- S3 Short Interest and Securities Finance Data published by S3 partners. This dataset contains the following fields:
Field | Description |
Business Date | Effective date for the rate |
Security IDs | Security identifiers contain Sedol, ISIN, FIGI, Ticker, Bloomberg ID |
Name | Security Name |
Offer Rate | Market composite financing fee paid for existing short positions |
Bid Rate | Market composite lending fee earned for existing shares on loan by long holders |
Last Rate | Market composite lending fee earned for incremental shares loaned on that date (spot rate) |
Crowding | The momentum indicator measures daily shorting and covering events relative to the market float |
Short Interest | Real-time short interest expressed in number of shares |
ShortInterestNotional | ShortInterest * Price (USD) |
ShortInterestPct | Real-time short interest expressed as a percentage of equity float |
S3Float | The number of tradable shares including synthetic longs created by short selling |
S3SIPctFloat | Real-time short interest projection divided by the S3 float |
IndicativeAvailability | S3 projected available lendable quantity |
Utilization | Real-time short interest divided by total lendable supply |
DaystoCover10Day | It is a liquidity measure = short interest / 10-day average ADTV |
DaystoCover30Day | It is a liquidity measure = short interest / 30-day average ADTV |
DaystoCover90Day | It is a liquidity measure = short interest / 90-day average ADTV |
Original SI | Point in time short interest |
To get the data, you will first search for the dataset in AWS Data Exchange and subscribe to the dataset:
Once the publisher of the datasets approves your subscription requests, you will have the datasets available for you to download to your S3 bucket:
Select Add auto-export job destination, provide the details of the S3 bucket, and download the dataset:
Repeat the steps to get the Alpha Vantage dataset. Once completed, you will have both datasets in your S3 bucket.
Data engineering
Once the dataset is in your S3 buckets, you can use AWS Glue DataBrew to transform the data. AWS Glue DataBrew offers over 350 pre-built transformations to automate data preparation tasks (such as filtering anomalies, standardizing formats, and correcting invalid values) that would otherwise require days or weeks of writing hand-coded transformations.
To create a consolidated curated dataset for forecasting in AWS DataBrew, perform the below steps. For detailed information, please refer to this blog.
- Create the DataBrew datasets.
- Load DataBrew datasets into DataBrew projects.
- Build the DataBrew recipes.
- Run the DataBrew jobs.
Create DataBrew Datasets: In AWS Glue DataBrew, a dataset represents data that is uploaded from the S3 bucket. We will create two DataBrew datasets—for both end-of-day stock price and S3 short interest. When you create your dataset, you enter the S3 connection details only once. From that point, DataBrew can access the underlying data for you.
Load the DataBrew datasets into DataBrew projects: In AWS Glue DataBrew, a project is the centerpiece of your data analysis and transformation efforts. A DataBrew project brings together the DataBrew datasets and enables you to develop a data transformation (DataBrew recipe). Here again, we will create two DataBrew projects, for end-of-day stock price and S3 short interest.
Build the DataBrew recipes: In DataBrew, a recipe is a set of data transformation steps. You can apply these steps to your dataset. For the use case, we will build two transformations. The first one will change the format of the end-of-day stock price timestamp column so that the dataset can be joined to the S3 short interest:
The second transformation curates the data, and its last step ensures we join the datasets into a single curated dataset. For more details on building data transformation recipes, refer to this blog.
DataBrew jobs: After the creation of the DataBrew recipes, you can run first the end-of-day stock price DataBrew job followed by the S3 short interest recipe. Refer to this blog to create a single consolidated dataset. Save the final curated dataset into an S3 bucket.
The end-to-end data engineering workflow will look like this:
Machine learning
With the curated dataset created post-data engineering, you can use Amazon SageMaker Canvas to build your forecasting model and analyze the impact of features on the forecast. Amazon SageMaker Canvas provides business users with a visual point-and-click interface that allows them to build models and generate accurate ML predictions on their own—without requiring any ML experience or having to write a single line of code.
To build a time series forecasting model in Amazon SageMaker Canvas, follow the below steps. For detailed information, refer to this blog:
- Select the curated dataset in SageMaker Canvas.
- Build the time series forecasting model.
- Analyze the results and feature importance.
Build the time series forecasting model: Once you have selected the dataset, select the target column to be predicted. In our case, this will be the close price of the stock ticker. SageMaker Canvas automatically detects this is a time series forecasting problem statement.
You will have to configure the model as follows for time series forecasting. For item ID, select the stock ticker name. Remember, our dataset has stock ticker prices for the top 10 stocks. Select the timestamp column for the time stamp, and finally, enter the number of days you want to forecast in the future [Forecast Horizon].
Now you are ready to build the model. SageMaker Canvas provides two options to build the model: Quick Build and Standard Build. In our case, we will use “Standard Build”.
Standard Build takes approximately three hours to build the model and uses Amazon Forecast, a time series forecasting service based on ML as the underlying forecasting engine. Forecast creates highly accurate forecasts through model ensembling of traditional and deep learning models without requiring ML experience.
Once the model is built, you can now review the model performance (prediction accuracy) and feature importance. As can be seen from the figure below, the model identifies Crowding and DaysToCover10Day as the two top features driving forecast values. This is in line with our market intuition, as crowding is a momentum indicator measuring daily shorting and covering events, and near-term short interest is a liquidity measure, indicating how investors are positioned in a stock. Both momentum and liquidity can drive price volatility.
This result indicates that these two features (or fields) have a close relationship with stock price movements and can be prioritized higher for onboarding and further analysis.
Business intelligence
In the context of time series forecasting, the notion of backtesting refers to the process of assessing the accuracy of a forecasting method using existing historical data. The process is typically iterative and repeated over multiple dates present in the historical data.
As we already discussed, SageMaker Canvas uses Amazon Forecast as the engine for time-series forecasting. Forecast creates a backtest as a part of the model building process. You can now view the predictor details by signing in to Amazon Forecast. For deeper dive understanding on Model Explainability, refer to this blog.
Amazon Forecast provides additional details on predictor metrics like weighted absolute percentage error (WAPE), root mean square error (RMSE), mean absolute percentage error (MAPE), and mean absolute scaled error (MASE). You can export predictor quality scores from Amazon Forecast.
Amazon Forecast runs one backtest for the time series dataset provided. The backtest results are available for download using the Export backtest results button. Exported backtest results are downloaded to an S3 bucket.
We will now plot the backtest results in Amazon QuickSight. To visualize the backtest results in Amazon QuickSight, connect to the dataset in Amazon S3 from QuickSight and create a visualization.
Clean up
AWS services leveraged in this solution are managed and serverless in nature. SageMaker Canvas is designed to run long running ML training and will be always on. Ensure you explicitly log off SageMaker Canvas. Please refer to the docs for more details.
Conclusion
In this blog post, we discussed how, as an institutional asset manager, you can leverage AWS low-code no-code (LCNC) data and AI services to accelerate the evaluation of external datasets by offloading the initial dataset screening to nontechnical personnel. This first-pass analysis can be done quickly to help you decide which datasets should be prioritized for onboarding and further analysis.
We demonstrated step-by-step how a data analyst can acquire new third-party data through AWS Data Exchange , use AWS Glue DataBrew no-code ETL services to preprocess data and evaluate which features in a dataset have the most impact on the model’s forecast.
Once data is analysis-ready, an analyst uses SageMaker Canvas to build a predictive model, evaluate its fit and identify significant features. In our example, the model’s MAPE (.05) and WAPE (.045) indicated a good fit and showed “Crowding” and “DaysToCover10Day” as the signals in the dataset with the largest impact over the forecast. This analysis quantified what data most influenced the model and could therefore be prioritized for further investigation and potential inclusion into your alpha signals or risk management process. And just as importantly, explainability scores indicate what data plays relatively little role in determining the forecast and therefore can be a lower priority for further investigation.
To more quickly evaluate the ability of third-party financial data to support your investment process, review the Financial Services data sources available on AWS Data Exchange, and give DataBrew and Canvas a try today.
About the Authors
Boris Litvin is Principal Solution Architect, responsible for Financial Services industry innovation. He is a former Quant and FinTech founder, passionate about systematic investing.
Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps high-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.
Camillo Anania is a Senior Startup Solutions Architect with AWS based in the UK. He is a passionate technologist helping startups of any size build and grow.
Dan Sinnreich is a Sr. Product Manager with AWS, focused on empowering companies to make better decisions with ML. He formerly built portfolio analytics platforms and multi-asset class risk models for large institutional investors.