AWS Machine Learning Blog

Prepare and clean your data for Amazon Forecast

You might use traditional methods to forecast future business outcomes, but these traditional methods are often not flexible enough to account for varying factors, such as weather or promotions, outside of the traditional time series data considered. With the advancement of machine learning (ML) and the elasticity that the AWS Cloud brings, you can now enjoy more accurate forecasts that influence business decisions. You will learn how to interpret and format your data according to what Amazon Forecast needs based on your business questions.

This post shows you how to prepare your data to optimally use with Amazon Forecast. Amazon Forecast is a fully managed service that allows you to forecast your time series data with high accuracy. It uses ML to analyze complex relationships in historical data and doesn’t require any prior ML experience. With its deep integration capabilities with the AWS Cloud, your forecasting process can be fully automated and highly flexible.

We will begin by understanding the different types of input data that Forecast accepts. With a retail use case, we will discuss how to structure your data to match the use case and forecasting granularity of the business metric that you are interested in forecasting. Then, we will discuss how to clean your data and handle challenging scenarios, such as missing values, to generate the most accurate forecasts.

Factors affecting forecast accuracy

Amazon Forecast uses your data to train a private, custom model tailored to your use case. ML models are only as good as the data put into them, and it’s important to understand what the model needs. Amazon Forecast can accept three types of datasets: target time series, related time series, and item metadata. Amongst those, target time series is the only mandatory dataset. This historical data provides the majority of the model’s accuracy.

Amazon Forecast provides predefined dataset domains that specify a schema of what data to include in which input datasets for common use cases, such as forecasting for retail, web traffic, and more. The domains are convenient column names only. The underlying models aren’t affected by these column names because they’re dropped prior to training. For the remainder of this post, we use the retail domain as an example.

Target time series data

Target time series data defines the historical demand for the resources you’re predicting. As mentioned earlier, the target time series dataset is mandatory. It contains three required fields:

  • item_id – Describes a unique identifier for the item or category you want to predict. This field may be named differently depending on your dataset domain (for example, in the workforce domain this is workforce_type, which helps distinguish different groups of your labor force).
  • timestamp – Describes the date and time at which the observation was recorded.
  • demand – Describes the amount of the item, specified by item_id, that was consumed at the timestamp specified. For example, this could be the number of pink shoes sold on a certain day.

You can also add additional fields in your input data. For example, in the retail dataset domain, you can optionally add an additional field titled location. This can help to add context about where the consumption occurred for that record and forecast the demand for items on a per-store basis where multiple stores are selling the same item. The best practice is to create a concatenated item_id identifier that includes product and location identifiers. One exception to this rule is if you know more than string names of locations, such as “Store 1”. If you know actual geolocations, such as postal codes or latitude/longitude points, then geolocation data such as weather can be pulled in automatically. This geolocation field needs to be separate from the item_id.

The frequency of your observations in the historical data you provide is also important, because it dictates the frequency of your forecasts that can be generated. You can provide target time series data with fine granularity such as a per-minute frequency, where historical demand is recorded every minute, up to as wide of a granularity as a yearly frequency. The data granularity must be smaller than or equal to your desired forecast granularity. If you want predictions on a monthly basis for each item, you should input data with monthly or finer granularity. The granularity shouldn’t be larger than your desired forecast frequency (for example, giving yearly observations in historical data when you want forecasts on a monthly basis).

High-quality datasets consist of dense data where there is almost a data point for every item and timestamp. Sparse data doesn’t give Amazon Forecast enough information to determine historical patterns to forecast with. To achieve accurate forecasts, ensure that you can supply dense data or fill in missing data points with null filling, as described later in this post.

Related time series data

In addition to historical sales data, other data may be known per item at exactly the same time as every sale. This data is called related time series data. Related data can give more clues to what future predictions could look like. The best related data is also known in the future. Examples of related data include prices, promotions, economic indicators, holidays, and weather. Although related time series data is optional, including additional information can help increase accuracy by providing context of various conditions that may have affected demand.

The related time series dataset must include the same dimensions as the target time series, such as the timestamp and item_id. Additionally, you can include up to a maximum of 13 related features. For more information about useful features you may want to include for different use cases, see Predefined Dataset Domains and Dataset Types.

Amazon Forecast trains a model using all input data. If the related time series doesn’t improve accuracy, it’s not used. When training with related data, it’s best to train using the CNN-QR algorithm, if possible, then check the model parameters to see if your related time series data was useful for improving accuracy.

Item metadata

Providing item metadata to Amazon Forecast is optional, but can help refine forecasts by adding contextual information about items that appear in your target time series data. Item metadata is static information that doesn’t change with time, describing features about items such as the color and size of a product being sold. Amazon Forecast uses this data to create predictions based on similarities between products.

To use item metadata, you upload a separate file to Amazon Forecast. Each row in the CSV file you upload must contain the item ID, followed by the metadata features for that item. Each row can have a maximum of 10 fields, including the field that contained the item ID.

Item metadata is required when forecasting demand for an item that has no historical demand, known as the cold start problem. This could be a new product that you want to launch, for example. Because item metadata is required, demand for new products can’t be forecasted except if your data qualifies to train a deep learning algorithm. By understanding the demand of items with similar features, Amazon Forecast predicts demand for your new product. For more information about forecasting for cold start scenarios, see the following best practices on GitHub.

Now that you understand the different types of input data and their formats, we explore how to manipulate your data to achieve your business objectives.

Structure your input data based on your business questions

When preparing your input data for Amazon Forecast, consider the business questions you want to ask. As mentioned earlier, Amazon Forecast requires three mandatory input columns (timestamp, item_id, and value) as part of your time series data. You need to prepare your input data by applying aggregations to your input data while keeping the eventual structure in line to the input format. The following scenarios explain how you can manipulate and prepare your input data depending on your business questions.

Imagine we have the following dataset showing your daily sales per product. In this example, your company is selling two different products (Product A and Product B) in different stores (Store 1 and Store 2) across two different countries (Canada and the US).

Date Product ID Sales Store ID Country
01-Jan Product A 3 Store-1 Canada
01-Jan Product B 5 Store-1 Canada
01-Jan Product A 4 Store-2 US
02-Jan Product A 4 Store-2 US
02-Jan Product B 3 Store-2 US
02-Jan Product A 2 Store-1 Canada
03-Jan Product B 1 Store-1 Canada

The granularity of the provided sales data is on a per-store, country, item ID, and per-day basis. This initial assessment is useful when we prepare the data for the input.

Now imagine you need to ask the following forecasting question: “How many sales should I anticipate for Product A on January 4?”

The question is looking for an answer for a particular day, so you need to tell Amazon Forecast to predict at a daily frequency. Amazon Forecast can produce the forecasts at the desired daily frequency because the raw data is reported at the same granularity level or less.

The question also asks for a specific product, Product A. Because the raw data reports sales on a per-product granularity already, no further data preparation action is required for product aggregation.

The source data shows that sales are reported per store. Because we’re not interested in forecasting on a per-store basis, you need to aggregate all the sales data of each product across all the stores.

Taking these into account, your Amazon Forecast input structure looks like the following table.

timestamp item_id demand
01-Jan Product A 7
01-Jan Product B 5
02-Jan Product A 6
02-Jan Product B 3
03-Jan Product B 1

Another business question you might ask could be: “How many sales should I anticipate from Canada on January 4?”

In this question, the granularity is still daily, so Amazon Forecast can produce daily forecasts. The question doesn’t ask for a specific product or store. However, it asks for a prediction on a country level. The source data shows that the data is broken down on a per-store basis, and each store has one-to-one mapping to a country. That means you need to sum up all sales across all the different stores within the same country.

Your Amazon Forecast input structure looks like the following table.

timestamp item_id demand
01-Jan Canada 8
01-Jan US 4
02-Jan Canada 2
02-Jan US 7
03-Jan Canada 1

Lastly, we ask the following question: “How much overall sales should I anticipate for February?”

This question doesn’t mention any dimensions other than time. That means that all the sales data should be aggregated across all products, stores, and countries per month. Because Amazon Forecast requires a specific date to use as the timestamp, you can use the first of each month to indicate a month’s aggregated demand. Your Amazon Forecast input structure looks like the following table.

timestamp item_id demand
01-Jan daily 22

This example data is just for demonstration purposes. Real-life datasets should be much larger, because a larger historical dataset yields more accurate predictions. For more information, see the data size best practices on GitHub. Remember that while you’re doing aggregations across dimensions, you’re reducing the total number of input data points. If there is little historical data, aggregation leads to fewer input data points, which may not be enough for Amazon Forecast to accurately train your predictor. You can experiment with different aggregation levels within your data and explore how they affect the accuracy of your predictions through iteration.

Data cleaning

Cleaning your data for Amazon Forecast is important because it can affect the accuracy of the forecasts that are created. To demonstrate some best practices, we use the Department store sales and stocks dataset provided by the Government of Canada. The data is already prepared for Amazon Forecast to predict on a monthly basis for each unique department using historical data from January 1991 to December 1997. The following table shows an excerpt of the cleaned data.

REF_DATE Type of department VALUE
1991-01 Bedding and household linens 37150
1991-02 Bedding and household linens 31470
1991-03 Bedding and household linens 34903
1991-04 Bedding and household linens 36218
1991-05 Bedding and household linens 40453
1991-06 Bedding and household linens 42204
1991-07 Bedding and household linens 48364
1991-08 Bedding and household linens 47920
1991-09 Bedding and household linens 44887
1991-10 Bedding and household linens 45551

In the following sections, we describe some of the steps that were taken to understand and cleanse our data.

Visualize the data

Previously, we discussed how granularity of data dictates forecast frequency and how you can manipulate data granularity to suit your business questions. With visualization, you can see at what levels of time and product granularity your data exhibits smoother patterns, which give the ML model better inputs for learning. If your data appears to be intermittent or sparse, try to aggregate data into a higher granularity (for example, aggregating all sales for a given day as a single data point) with equally spaced time intervals. If your data has too few observations to determine a trend over time, your data has been aggregated at too high a level and you should reduce the granularity to a finer level. For sample Python code, see our Data Prep notebook.

In the following chart of yearly demand for bedding and household items, we visualize the data from earlier at the yearly aggregation level. The chart shows a one-time bump in the year 1994 that isn’t repeated. This is a bad aggregation level to use because there is no repeatable pattern to the historical sales. In addition, yearly aggregation results in too little historical data, which isn’t enough for Amazon Forecast to use.

Next, we can visualize our sample dataset at a monthly granularity level to identify patterns in our data. In the following figure, we plotted data for the bedding and household items department and added a trendline. We can observe a seasonal trend that is predictable, which Amazon Forecast can learn and predict with.

Handle missing and zero values

You must also be careful of gaps and zero values within your target time series data. If the target field value (such as demand) is zero for a timestamp and item ID combination, this could mean that data was simply missing, the item wasn’t in stock, and so on. Having zeroes in your data that aren’t actual zeroes, such as values representing new or end-of-life products, can bias a model toward zero. When preparing your data, one best practice is to convert all zero values to null and let Amazon Forecast do the heavy lifting by automatically detecting new products and end-of-life products. In addition, adding an out-of-stock related variable per item_id and timestamp can improve accuracy. When you replace zeroes with null values, they’re replaced according to the filling logic you specify, which you can change based on your null filling strategy.

In our sample dataset, the data for the plumbing, heating, and building materials department is blank or contains a 0 after June 1993.

timestamp item_id demand
1991-01-01 Plumbing, heating and building materials 5993
1991-02-01 Plumbing, heating and building materials 4661
1991-03-01 Plumbing, heating and building materials 5826
1993-05-01 Plumbing, heating and building materials 5821
1993-06-01 Plumbing, heating and building materials 6107
1993-07-01 Plumbing, heating and building materials
1993-08-01 Plumbing, heating and building materials
1993-09-01 Plumbing, heating and building materials
…. …. ….
1995-11-01 Plumbing, heating and building materials
1995-12-01 Plumbing, heating and building materials 0
1996-01-01 Plumbing, heating and building materials 0
1996-02-01 Plumbing, heating and building materials 0

Upon further inspection, only blank and zero values are observed until the end of the dataset. This is known as the end-of-life problem. We have two options: simply remove these items from training data because we know that their forecast should be zero, or replace all zeroes with nulls and let Amazon Forecast automatically detect end-of-life products with null filling logic.

Conclusion

This post outlined how to prepare your data to predict according to your business outcomes with Amazon Forecast. When you follow these best practices, Amazon Forecast can create highly accurate probabilistic forecasts. To learn more about data preparation for Amazon Forecast and best practices, refer to the Amazon Forecast Cheat Sheet and the sample data preparation Jupyter notebook. You can also take a self-learning workshop and browse our other sample Jupyter notebooks that show how to productionize with Amazon Forecast.


About the Authors

Murat Balkan is an AWS Solutions Architect based in Toronto. He helps customers across Canada to transform their businesses and build industry leading solutions on AWS.

 

 

 

Christy Bergman is working as an AI/ML Specialist Solutions Architect at AWS. Her work involves helping AWS customers be successful using AI/ML services to solve real-world business problems. Prior to joining AWS, Christy worked as a data scientist in banking and software industries. In her spare time, she enjoys hiking and bird watching.

 

 

Brandon How is an AWS Solutions Architect who works with enterprise customers to help design scalable, well-architected solutions on the AWS Cloud. He is passionate about solving complex business problems with the ever-growing capabilities of technology.