Getting started with Feast, an open source feature store running on AWS Managed Services

This post was written by Willem Pienaar, Principal Engineer at Tecton and creator of Feast.

Feast is an open source feature store and a fast, convenient way to serve machine learning (ML) features for training and online inference. Feast lets you build point-in-time correct training datasets from feature data, allows you to deploy a production-grade feature serving stack to Amazon Web Services (AWS) in seconds, and simplifies tracking features models are using.

Why Feast?

Most ML teams today are well versed in shipping machine learning models into production, but deploying models into production is only a small part of the MLOps lifecycle. Most teams don’t have a declarative way to ship data into production for consumption by machine learning models. That’s where Feast helps.

Tracking and sharing features: Feast allows teams to define and track feature metadata (such as data sources, entities, and features) through declarative definitions that are version controlled in Git. This allows teams to maintain a versioned history of operationalized features, helping teams understand how features are performing in production, and enabling reuse and sharing of features across teams.
Managed serving infrastructure: Feast takes all the work out of setting up data infrastructure. Feast makes configuring your data infrastructure for serving features possible, makes populating these stores with feature values easy, and provides an SDK for reading feature values from these stores at low latency.
A consistent view of data: Machine learning models need to see a consistent view of features in training as they will see in production. Feast ensures this consistency through time-travel-based training dataset generation, and through a unified serving interface that helps your online models see a consistent view of features during inference and training.

Feast on AWS

With the latest release of Feast, you can take advantage of AWS storage services to run an open source feature store:

Amazon Redshift and Amazon Simple Storage Service (Amazon S3) can be used as an offline store, which supports feature serving for training and batch inference of large amounts of feature data.
Amazon DynamoDB, a NoSQL key-value database, can be used as an online store. Amazon DynamoDB supports feature serving at low latency for real-time prediction.

Use case: Real-time credit scoring

When individuals apply for loans from banks and other credit providers, the decision to approve a loan application is made through a statistics model. Often, this model uses information about a customer to determine the likelihood that they will repay or default on a loan. This process is called credit scoring.

For this use case, we will demonstrate how a real-time credit scoring system can be built using Feast and scikit-Learn.

This real-time system is required to accept a loan request from a customer and respond within 100 ms with a decision on whether their loan has been approved or rejected.

A fully working demo repository for this use case is available on GitHub.

Data model

We have three datasets at our disposal to build this credit scoring system.

The first is a loan dataset. This dataset has features based on historic loans for current customers. Importantly, this dataset contains the target column, loan_status. This column denotes whether a customer has defaulted on a loan.

Column	Description	Sample
`loan_id`	Unique id for the loan	12208
`dob_ssn`	Date of birth joined to SSN	19790429_9552
`zipcode`	Zip code of the customer	30721
`person_age`	Age of customer	24
`person_income`	Yearly income of the customer	30000
`person_home_ownership`	Home ownership class for customer	RENT
`person_emp_length`	How long the customer has been employed (months)	2.0
`loan_intent`	Reason for taking out loan	EDUCATION
`loan_amnt`	Loan amount	3000
`loan_int_rate`	Loan interest rate	5.2
`loan_status`	Status of loan	0
`event_timestamp`	When the loan was issued or updated	2021-07-28 17:09:19
`created_timestamp`	When this record was written to storage	2021-07-28 17:09:19

The second dataset we will use is a zip code dataset. This dataset is used to enrich the loan dataset with supplementary features about a specific geographic location.

Column	Description	Sample
`zipcode`	Zip code to which features relate	94546
`city`	City to which features relate	CASTRO VALLEY
`state`	State to which features relate	CA
`tax_returns_filed`	Amount of tax returns filed in this zip code	20616
`population`	Total population of this zip code	35351
`wages`	Combined yearly earnings for all individuals in this zip code	987939047
`event_timestamp`	When the zipcode features were collected	2017-01-01 12:00:00
`created_timestamp`	When this record was written to storage	2017-01-01 12:00:00

The third and final dataset is a credit history dataset. This is a dataset that contains the credit history on a per-person basis and is updated on a frequent basis by the credit institution. Every time a credit check is done on an individual, this dataset will be updated.

Column	Description	Sample
`dob_ssn`	Date of birth joined to SSN	19530219_5179
`credit_card_due`	How much this person owes on their credit cards	0
`mortgage_due`	How much this person owes on their mortgages	91803
`student_loan_due`	How much this person owes on their student loans	0
`vehicle_loan_due`	How much this person owes on their vehicle loans	0
`hard_pulls`	How many hard credit checks this person has had	1
`missed_payments_2y`	How many missed payments this person has had in the last 2 years	1
`missed_payments_1y`	How many missed payments this person has had in the last 1 years	0
`missed_payments_6m`	How many missed payments this person has had in the last 6 months	0
`bankruptcies`	How many bankruptcies this person has had	0
`event_timestamp`	When the credit check was executed	2017-01-01 12:00:00
`created_timestamp`	When this record was written to storage	2017-01-01 12:00:00

The preceding loan, zip code, and credit history features will be combined into a single training dataset when building a credit-scoring model. However, historic loan data is not useful for making predictions based on new customers. Therefore, we will register and serve only the zip code and credit history features with Feast, and we will assume that the incoming request contains the loan application features.

An example of the loan application payload is as follows:

loan_request = {
   "zipcode": [76104],
   "dob_ssn": [19530219_5179],
   "person_age": [133],
   "person_income": [59000],
   "person_home_ownership": ["RENT"],
   "person_emp_length": [123.0],
   "loan_intent": ["PERSONAL"],
   "loan_amnt": [35000],
   "loan_int_rate": [16.02],
}

Amazon S3 and Redshift as a data source and offline store

A Redshift data source allows you to fetch historical feature values from Redshift for building training datasets and materializing features into an online store.

Install Feast using pip:

pip install feast

Initialize a blank feature repository:

feast init -m credit_scoring

This command will create a feature repository for your project. Let’s edit our feature store configuration using the provided feature_store.yaml:

project: credit_scoring_aws
registry: registry.db # where we will store our feature metadata
provider: aws # the environment we are deploying to

online_store:
   type: dynamodb # the online feature store
   region: us-west-2

offline_store:
   type: redshift  # the offline feature store
   cluster_id: 
   region: us-west-2
   user: admin
   database: dev
   s3_staging_location:  
   iam_role:

A data source is defined as part of the Feast Declarative API in the feature repo directory’s Python files. Now that we’ve configured our infrastructure, let’s register the zip code and credit history features we will use during training and serving.

Create a file called features.py within the credit_scoring/ directory. Then add the following feature definition to features.py:

from datetime import timedelta
from feast import Entity, Feature, FeatureView, RedshiftSource, ValueType

zipcode = Entity(name="zipcode", value_type=ValueType.INT64)

zipcode_features = FeatureView(
    name="zipcode_features",
    entities=["zipcode"],
    ttl=timedelta(days=3650),
    features=[
        Feature(name="city", dtype=ValueType.STRING),
        Feature(name="state", dtype=ValueType.STRING),
        Feature(name="location_type", dtype=ValueType.STRING),
        Feature(name="tax_returns_filed", dtype=ValueType.INT64),
        Feature(name="population", dtype=ValueType.INT64),
        Feature(name="total_wages", dtype=ValueType.INT64),
    ],
    batch_source=RedshiftSource(
        query="SELECT * FROM spectrum.zipcode_features",
        event_timestamp_column="event_timestamp",
        created_timestamp_column="created_timestamp",
    ),
)

dob_ssn = Entity(
    name="dob_ssn",
    value_type=ValueType.STRING,
)

credit_history_source = RedshiftSource(
    query="SELECT * FROM spectrum.credit_history",
    event_timestamp_column="event_timestamp",
    created_timestamp_column="created_timestamp",
)

credit_history = FeatureView(
    name="credit_history",
    entities=["dob_ssn"],
    ttl=timedelta(days=90),
    features=[
        Feature(name="credit_card_due", dtype=ValueType.INT64),
        Feature(name="mortgage_due", dtype=ValueType.INT64),
        Feature(name="student_loan_due", dtype=ValueType.INT64),
        Feature(name="vehicle_loan_due", dtype=ValueType.INT64),
        Feature(name="hard_pulls", dtype=ValueType.INT64),
        Feature(name="missed_payments_2y", dtype=ValueType.INT64),
        Feature(name="missed_payments_1y", dtype=ValueType.INT64),
        Feature(name="missed_payments_6m", dtype=ValueType.INT64),
        Feature(name="bankruptcies", dtype=ValueType.INT64),
    ],
    batch_source=credit_history_source,
)

Feature views allow users to register data sources in their organizations into Feast, and then use those data sources for both training and online inference. The preceding feature view definition tells Feast where to find zip code and credit history features.

Now that we have defined our first feature view, we can apply the changes to create our feature registry and configure our infrastructure:

feast apply

Registered entity dob_ssn
Registered entity zipcode
Registered feature view credit_history
Registered feature view zipcode_features
Deploying infrastructure for credit_history
Deploying infrastructure for zipcode_features

The preceding apply command will:

Store all entity and feature view definitions in a local file called registry.db.
Create an empty DynamoDB table for serving zip code and credit history features.
Ensure that your data sources on Redshift are available.

Building a training dataset

Our loan dataset contains our target variable, so we will load that first:

loans_df = pd.read_parquet("loan_table.parquet")

But this dataset does not contain all the features we need in order to make an accurate scoring prediction. We also must join our zip code and credit history features, and we need to do so in a point-in-time correct way.

First, we create a feature store object from our feature repository:

fs = feast.FeatureStore(repo_path="credit_scoring/")

Then we identify the features we want to query from Feast:

feast_features = [
   "zipcode_features:city",
   "zipcode_features:state",
   "zipcode_features:location_type",
   "zipcode_features:tax_returns_filed",
   "zipcode_features:population",
   "zipcode_features:total_wages",
   "credit_history:credit_card_due",
   "credit_history:mortgage_due",
   "credit_history:student_loan_due",
   "credit_history:vehicle_loan_due",
   "credit_history:hard_pulls",
   "credit_history:missed_payments_2y",
   "credit_history:missed_payments_1y",
   "credit_history:missed_payments_6m",
   "credit_history:bankruptcies",
]

Then we make a query from Feast to enrich our loan dataset. Feast will automatically detect the zip code and dob_ssn join columns and join the feature data in a point-in-time correct way. It does this by only joining features that were available at the time the loan was active.

training_df = fs.get_historical_features(
   entity_df=loans, features=feast_features
).to_df()

Once we have retrieved the complete training dataset, we can:

Drop timestamp columns and the loan_id column.
Encode categorical features.
Split the training dataframe into a train, validation, and test set.

Finally, we can train our classifier:

from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf.fit(train_X[sorted(train_X)], train_Y)

The full model training code is on GitHub.

DynamoDB as an online store

Before we can make online loan predictions with our credit scoring model, we must populate our online store with feature values. To load features into the online store, we use materialize incremental:

CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME

This command will load features from our zip code and credit history data sources up to the $CURRENT_TIME. The materialize command can be repeatedly called as more data becomes available in order to keep the online store fresh.

Fetching a feature vector at low latency

Now we have everything we need to make a loan prediction.

# The incoming loan request is shown in the following object
loan_request = {
   "zipcode": [76104],
   "dob_ssn": ["19632106_4278"],
   "person_age": [133],
   "person_income": [59000],
   "person_home_ownership": ["RENT"],
   "person_emp_length": [123.0],
   "loan_intent": ["PERSONAL"],
   "loan_amnt": [35000],
   "loan_int_rate": [16.02],
}

# Next we fetch our online features from DynamoDB using Feast
customer_zipcode = loan_request['zipcode'][0]
dob_ssn = loan_request["dob_ssn"][0]

feature_vector = fs.get_online_features(
   entity_rows=[{"zipcode": zipcode, "dob_ssn": dob_ssn}],
   features=feast_features,
).to_dict()

# Then we join the Feast features to the loan request
features = loan_request.copy()
features.update(feature_vector)
features_df = pd.DataFrame.from_dict(features)

# Finally we make a prediction
prediction = clf.predict(features_df)  # 1 = default, 0 = will repay

Conclusion

That’s it! We have a functional real-time credit scoring system.

Check out the feast GitHub repository for the latest features, such as on-demand transformation, Feast server deployment to AWS Lambda, as well as support for streaming sources.

The complete end-to-end real-time credit scoring system is available on GitHub. Feel free to deploy it and try it out.

If you want to participate in the Feast community, join us on Slack, or read the Feast documentation to get a better understanding of how to use Feast.

Willem Pienaar

Willem is a principal engineer at Tecton where he leads open source development for Feast, the open source feature store. Willem previously started and led the data science platform team at Gojek, where his team built and operated an ML platform that supported ML systems for pricing, recommendations, forecasting, fraud detection, and matchmaking, all processing hundreds of million orders every month. His main focus areas are building operational data and ML tools and systems. In a previous life, Willem founded and sold a networking startup and was a software engineer in industrial control systems.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

AWS Open Source Blog