AWS for Industries
How Sigmoid Uses DataWig From Amazon Science for Missing Value Imputation to Make CPG Dataset Ready for Machine Learning
In the training of a machine learning (ML) model, the quality of the model is directly proportional to the quality of data. However, in many cases, in consumer packaged goods (CPG) datasets, there are a lot of missing values affecting the quality of training and prediction in the long run.
If your models are already operationalized on Amazon SageMaker, which is used to build, train, and deploy ML models for virtually any use case, you can use Amazon SageMaker Data Wrangler, which lets you simplify the process of data preparation and feature engineering and complete each step of the data preparation workflow from a single visual interface. But if you are maintaining an on-premises ML environment, if you run your ML trainings and models on Amazon Elastic Compute Cloud (Amazon EC2), which provides resizable compute capacity for virtually any workload, or if you are not ready to migrate to Amazon Web Services (AWS), you will need to find a way to impute the missing values in a scientific way.
There are several methods that can be used to fill the missing values, but in this blog, together with Sigmoid, an AWS Partner, we will show you how to use DataWig for missing data imputation and why it is efficient for ML data preprocessing.
DataWig is an ML model developed by the Amazon Science team and is primarily used for missing value imputation. The model is based on deep learning and trained with Apache MXNet, then packaged as a library. DataWig runs as a backend when you train your ML algorithms, and it helps you generate the predicted missing values.
In this blog, we will look into some of the important components of the library and how it can be used for imputing missing values in a dataset.
Important components of the DataWig library
To understand how the DataWig library works, let’s first go through some of the important components and understand what they do.
- ColumnEncoders
- The ColumnEncoders convert the raw data of a column into an encoded numerical representation.
- There are four ColumnEncoders provided in the DataWig library:
- SequentialEncoder: provides encoding for text data (characters or words)
- BowEncoder: provides bag-of-words encoding for text data (hashing vectorizer or term frequency–inverse document frequency based on the algorithm used)
- CategoricalEncoder: provides one-hot encoding for categorical columns
- NumericalEncoder: provides encoding for numerical columns
- Column featurizers
- Column featurizers are used to feed encoded data from ColumnEncoders into the imputer model’s computational graph for training and prediction.
- There are four column featurizers present in the DataWig library:
- LSTMFeaturizer: is used with SequentialEncoder and maps the sequence of input into vectors using long short-term memory (LSTM)
- BowFeaturizer: is used with bag-of-words-encoded columns
- EmbeddingFeaturizer: maps encoded categorical columns into vector representation (word embeddings)
- NumericalFeaturizer: is used with numerical-encoded columns and extracts features using fully connected layers
- SimpleImputer
- Using SimpleImputer is one of the simplest ways that you can train a missing value imputation model. It only takes three parameters:
- Input_Column: represents the list of feature columns
- Output_Column: takes the name of the target column that one is training
- Output_Path: is the path where the trained model will be stored
- For example, we have a dataset with three different columns: a, b, and c. Based on a and b, we want to fill the missing values of column c. In this case, SimpleImputer will work as follows:
- Using SimpleImputer is one of the simplest ways that you can train a missing value imputation model. It only takes three parameters:
-
- While using SimpleImputer, you don’t need to worry about encoding and featurizing different input columns because the library automatically detects the data type for each column and uses the encoders and featurizers accordingly.
- This restricts you to less control over the training process, but in general, it yields good results.
- After passing the above parameter, you have two options:
- Imputer.fit: is used to train the model
- Imputer.fit_hpo: is used to train and tune the model (it has a built-in dictionary to choose the values from, and you can pass hyperparameters in the form of a custom dictionary to tune the model based on project requirements)
- Imputer
-
- Imputer gives you more control over the training process, which is one of the primary reasons for using Imputer over SimpleImputer.
- Imputer takes four parameters as inputs:
- Data_Featurizers: a list of featurizers associated with different feature columns
- Label_Encoders: a list of encoded target columns
- Data_Encoders: a list of encoders associated with different feature columns
- Output_Path: the path where the trained model will be stored
- For example, we have a dataset with three different columns: a, b, and c. Based on a and b, we want to fill the missing values of column c. In this case, Imputer will work as follows:
-
- After defining the Imputer with the above parameters, we can simply call the “fit” function to begin the training.
- Imputer has several advantages over SimpleImputer:
- More customization is possible for the training purpose.
- You can tune the parameters while encoding the feature and target columns to get a balance between the training time and the accuracy of the model.
How DataWig helped in Sigmoid’s project with a customer
- Overview of the project:
- We had a dataset with 50 columns, and we had to impute the missing values in 25 columns out of those 50.
- Out of the 25 columns, 13 were numerical and 12 were categorical.
- Below is a summary of the dataset:
- Our approach:
- For each of the target columns (the 25 columns for which we were doing the imputation), we did the feature selection and then ran DataWig using the Imputer. We did this because we can run the Imputer over all the target columns at the same time on the loop, and it was pretty straightforward.
- After the base model result was available, we continued to tune the model.
- Below are the final results on the target columns.
- Numerical Columns:
-
-
-
- The defined metric from the project owner was a root-mean-square error (RMSE) / standard deviation of ≦0.5 for the numerical columns, and the acceptable score was an RMSE / standard deviation of ≦0.8.
- In the above table, the computed metrics are mentioned for all the numerical columns against their respective training, validation, and testing datasets.
- We could achieve a standard deviation of <0.5 for numerical columns.
- Categorical Columns:
-
-
-
-
-
- For the categorical columns, we used accuracy as the key metric. The ideal target was an accuracy of ≧95%. The acceptable target was an accuracy of ≧85%.
- We could achieve >90% accuracy in predicting categorical data.
-
-
Conclusion
It is great if you already have your ML operations (MLOps) pipelines on AWS using Amazon SageMaker. But if not, DataWig from the Amazon Science team is a great choice as an imputation tool, whether you want to solve simple imputation problems or complex scalable imputation problems with comparable or even better results than other standard practices. If you would like more information about how Sigmoid and AWS help customers in the CPG industry, leave a comment on this blog. To request a demo or to ask any other questions, visit Sigmoid or contact your AWS account team today.
AWS Partner spotlight
Sigmoid delivers actionable intelligence for CPG enterprises. Sigmoid’s CPG analytics solution portfolio is specifically designed to equip CPG decision-makers with targeted consumer insights to drive growth. Sigmoid’s expertise in CPG analytics helps companies build robust data infrastructures that simplify every step of managing big data in the CPG industry. By solving complex analytics use cases, brands can engage effectively with consumers, forecast demand accurately, optimize inventory levels, and take actions based on near-real-time sales data across the ecommerce and retail partners community.