AWS Machine Learning Blog
Detect financial transaction fraud using a Graph Neural Network with Amazon SageMaker
Fraud plagues many online businesses and costs them billions of dollars each year. Financial fraud, counterfeit reviews, bot attacks, account takeovers, and spam are all examples of online fraud and malicious behaviors.
Although many businesses take approaches to combat online fraud, these existing approaches can have severe limitations. First, many existing methods aren’t sophisticated or flexible enough to detect the whole spectrum of fraudulent or suspicious online behaviors. Second, fraudsters can evolve and adapt to deceive simple rule-based or feature-based methods. For instance, fraudsters can create multiple coordinated accounts to avoid triggering limits on individual accounts.
However, if we construct a full interaction graph encompassing not only single transaction data but also account information, historical activities, and more, it’s more difficult for fraudsters to conceal their behavior. For example, accounts that are often connected to other fraudulent-related nodes may indicate guilt by association. We can also combine weak signals from individual nodes to derive stronger signals about that node’s activity. Fraud detection with graphs is effective because we can detect patterns such as node aggregation, which may occur when a particular user starts to connect with many other users or entities, and activity aggregation, which may occur when a large number of suspicious accounts begin to act in tandem.
In this post, we show you how to quickly deploy a financial transaction fraud detection solution with Graph Neural Networks (GNNs) using Amazon SageMaker JumpStart.
Alternatively, if you are looking for a fully managed service to build customized fraud detection models without writing code, we recommend checking out Amazon Fraud Detector. Amazon Fraud Detector enables customers with no machine learning experience to automate building fraud detection models customized for their data, leveraging more than 20 years of fraud detection expertise from Amazon Web Services (AWS) and Amazon.com.
Benefits of Graph Neural Networks
To illustrate why a Graph Neural Network is a great fit for online transaction fraud detection, let’s look at the following example heterogeneous graph constructed from a sample dataset of typical financial transaction data.
A heterogeneous graph contains different types of nodes and edges, which in turn tend to have different types of attributes that are designed to capture characteristics of each node and edge type.
The sample dataset contains not only features of each transaction, such as the purchased product type and transaction amount, but also multiple identity information that could be used to identify the relations between transactions. That information can be used to construct Relational Graph Convolutional Networks (R-GCNs). In the preceding example graph, the node types correspond to categorical columns in the sample dataset such as card number, card type, and email domain.
GNNs utilize all the constructed information to learn a hidden representation (embedding) for each transaction, so that the hidden representation is used as input for a linear classification layer to determine whether the transaction is fraudulent or not.
The solution shown in this post uses Amazon SageMaker and the Deep Graph Library (DGL) to construct a heterogeneous graph from tabular data and train an R-GCNs model to identify fraudulent transactions.
Solution overview
At a high level, the solution trains a graph neural network to accept an interaction graph, as well as some features about the users in order to classify those users as potentially fraudulent or not. This approach ensures that we detect signals that are present in the user attributes or features, as well as in the connectivity structure, and interaction behavior of the users.
This solution employs the following algorithms:
- R-GCNs, which is a state-of-the-art GNN model for heterogenous graph input
- SageMaker XGBoost, which we use as the baseline model to compare performances
By default, this solution uses synthetic datasets that are created to mimic typical examples of financial transactions datasets. We demonstrate how to use your own labeled dataset later in this post.
The outputs of the solution are as follows:
- An R-GCNs model trained on the input datasets.
- An XGBoost model trained on the input datasets.
- Predictions of the probability for each transaction being fraudulent. If the estimated probability of a transaction is over a threshold, it’s classified as fraudulent.
In this solution, we focus on the SageMaker components, which include two main parts:
- The first part uses SageMaker Processing to do feature engineering and extract edge lists from a table of transactions or interactions
- The second part uses SageMaker training infrastructure to train the models for fraud detection, including an XGBoost baseline model, and a GNN model with the DGL
The following diagram illustrates the solution architecture.
Prerequisites
To try out the solution in your own account, make sure that you have the following in place:
- You need an AWS account to use this solution. If you don’t have an account, you can sign up for one.
- The solution outlined in this post is part of JumpStart. To run this JumpStart 1P Solution and have the infrastructure deploy to your AWS account, you need to create an active Amazon SageMaker Studio instance (see Onboard to Amazon SageMaker Domain).
When the Studio instance is ready, you can launch Studio and access JumpStart. JumpStart solutions are not available in SageMaker notebook instances, and you can’t access them through SageMaker APIs or the AWS Command Line Interface (AWS CLI).
Launch the solution
To launch the solution, complete the following steps:
- Open JumpStart by using the JumpStart launcher in the Get Started section or by choosing the JumpStart icon in the left sidebar.
- Under Solutions, choose Fraud Detection in Financial Transactions to open the solution in another Studio tab.
- In the solution tab, choose Launch to launch the solution.
The solution resources are provisioned and another tab opens showing the deployment progress. When the deployment is finished, an Open Notebook button appears. - Choose Open Notebook to open the solution notebook in Studio.
Explore the default dataset
The default dataset used in this solution is a synthetic dataset created to mimic typical examples of financial transactions dataset that many companies have. The dataset consists of two tables:
- Transactions – Records transactions and metadata about transactions between two users. Examples of columns include the product code for the transaction and features on the card used for the transaction, and a column indicating whether the corresponded transaction is fraud or not.
- Identity – Contains information about the identity users performing transactions. Examples of columns include the device type and device IDs used.
The two tables can be joined together using the unique identified-key column TransactionID
. The following screenshot shows the first five observations of the Transactions
dataset.
The following screenshot shows the first five observations of the Identity
dataset.
The following screenshot shows the joined dataset.
Besides the unique identifier column (TransactionID
) to identify each transaction, there are two types of predicting columns and one target column:
- Identity columns – These contain identity information related to a transaction, including
card_no
,card_type
,email_domain
,IpAddress
,PhoneNo
, andDeviceID
- Categorical or numerical columns – These describe the features of each transaction, including
ProductCD
andTransactionAmt
- Target column – The
isFraud
column indicates whether the corresponded transaction is fraudulent or not
The goal is to fully utilize the information in the predicting columns to classify each transaction (each row in the table) to be either fraud or not fraud.
Upload the raw data to Amazon S3
The solution notebook contains code that downloads the default synthetic datasets and uploads them to the input Amazon Simple Storage Service (Amazon S3) bucket provisioned by the solution.
To use your own labeled datasets, before running the code cell in the Upload raw data to S3 notebook section, edit the value of the variable raw_data_location
so that it points to the location of your own input data.
In the Data Visualization notebook section, you can run the code cells to visualize and explore the input data as tables.
If you’re using your own datasets with different data file names or table columns, remember to also update the data exploration code accordingly.
Data preprocessing and feature engineering
The solution provides a data preprocessing and feature engineering Python script data-preprocessing/graph_data_preprocessor.py
. This script serves as a general processing framework to convert a relational table to heterogeneous graph edge lists based on the column types of the relational table. Some of the data transformation and feature engineering techniques include:
- Performing numerical encoding for categorical variables and logarithmic transformation for transaction amount
- Constructing graph edge lists between transactions and other entities for the various relation types
All the columns in the relational table are classified into one of the following three types for data transformation:
- Identity columns – Columns that contain identity information related to a user or a transaction, such as IP address, phone number, and device identifiers. These column types become node types in the heterogeneous graph, and the entries in these columns become the nodes.
- Categorical columns – Columns that correspond to categorical features such as a user’s age group, or whether or not a provided address matches an address on file. The entries in these columns undergo numerical feature transformation and are used as node attributes in the heterogeneous graph.
- Numerical columns – Columns that correspond to numerical features such as how many times a user has tried a transaction. The entries here are also used as node attributes in the heterogeneous graph. The script assumes that all columns in the tables that aren’t identity columns or categorical columns are numerical columns.
The names of the identity columns and categorical columns need to be provided as command line arguments when running the Python script (--id-cols
for identity column names and --cat-cols
for category column names).
If you’re using your own data and your data is in the same format as the default synthetic dataset but with different column names, you simply need to adapt the Python arguments in the notebook code cell according to your dataset’s column names. However, if your data is in a different format, you need to modify the following section in the data-preprocessing/graph_data_preprocessor.py
Python script.
We divide the dataset into training (70% of the entire data), validation (20%), and test datasets (10%). The validation dataset is used for hyperparameter optimization (HPO) to select the optimal set of hyperparameters. The test dataset is used for the final evaluation to compare various models. If you need to adjust these ratios, you can use the command line arguments --train-data-ratio
and --valid-data-ratio
when running the preprocessing Python script.
When the preprocessing job is complete, we have a set of bipartite edge lists between transactions and different device ID types (suppose we’re using the default dataset), as well as the features, labels, and a set of transactions to validate our graph model performance. You can find the transformed data in the S3 bucket created by the solution, under the dgl-fraud-detection/preprocessed-data
folder.
Train an XGBoost baseline model with HPO
Before diving into training a graph neural network with the DGL, we first train an XGBoost model with HPO as the baseline on the transaction table data.
- Read the data from
features_xgboost.csv
and upload the data to Amazon S3 for training the baseline model. This CSV file was generated in the data preprocessing and feature engineering job in the last step. Only the categorical columnsproductCD
,card_type
, and the numerical columnTransactionAmt
are included. - Create an XGBoost estimator with the SageMaker XGBoost algorithm container.
- Create and fit an XGBoost estimator with HPO:
- Specify dynamic hyperparameters we want to tune and their searching ranges.
- Define optimization objectives in terms of metrics and objective type.
- Create hyperparameter tuning jobs to train the model.
- Deploy the endpoint of the best tuning job and make predictions with the baseline model.
Train the Graph Neural Network using the DGL with HPO
Graph Neural Networks work by learning representation for nodes or edges of a graph that are well suited for some downstream tasks. We can model the fraud detection problem as a node classification task, and the goal of the GNN is to learn how to use information from the topology of the sub-graph for each transaction node to transform the node’s features to a representation space where the node can be easily classified as fraud or not.
Specifically, we use a relational graph convolutional neural networks model (R-GCNs) on a heterogeneous graph because we have nodes and edges of different types.
- Define hyperparameters to determine properties such as the class of GNN models, the network architecture, the optimizer, and optimization parameters.
- Create and train the R-GCNs model.
For this post, we use the DGL, with MXNet as the backend deep learning framework. We create a SageMaker MXNet estimator and pass in our model training script (sagemaker_graph_fraud_detection/dgl_fraud_detection/ train_dgl_mxnet_entry_point.py
), the hyperparameters, as well as the number and type of training instances we want to use. When the training is complete, the trained model and prediction result on the test data are uploaded to Amazon S3.
- Optionally, you can inspect the prediction results and compare the model metrics with the baseline XGBoost model.
- Create and fit a SageMaker estimator using the DGL with HPO:
- Specify dynamic hyperparameters we want to tune and their searching ranges.
- Define optimization objectives in terms of metrics and objective type.
- Create hyperparameter tuning jobs to train the model.
- Read the prediction output for the test dataset from the best tuning job.
Clean up
When you’re finished with this solution, make sure that you delete all unwanted AWS resources to avoid incurring unintended charges. In the Delete solution section on your solution tab, choose Delete all resources to delete resources automatically created when launching this solution.
Alternatively, you can use AWS CloudFormation to delete all standard resources automatically created by the solution and notebook. To use this approach, on the AWS CloudFormation console, find the CloudFormation stack whose description contains sagemaker-graph-fraud-detection
, and delete it. This is a parent stack; deleting this stack automatically deletes the nested stacks.
With either approach, you still need to manually delete any extra resources that you may have created in this notebook. Some examples include extra S3 buckets (in addition to the solution’s default bucket), extra SageMaker endpoints (using a custom name), and extra Amazon Elastic Container Registry (Amazon ECR) repositories.
Conclusion
In this post, we discussed the business problem caused by online transaction fraud, the issues in traditional fraud detection approaches, and why a GNN is a good fit for solving this business problem. We showed you how to build an end-to-end solution for detecting fraud in financial transactions using a GNN with SageMaker and a JumpStart solution. We also explained other features of JumpStart, such as using your own dataset, using SageMaker algorithm containers, and using HPO to automate hyperparameter tuning and find the best tuning job to make predictions.
To learn more about this JumpStart solution, check out the solution’s GitHub repository.
About the Authors
Xiaoli Shen is a Solutions Architect and Machine Learning Technical Field Community (TFC) member at Amazon Web Services. She’s focused on helping customers architecting on the cloud and leveraging AWS services to derive business value. Prior to joining AWS, she was a senior full-stack engineer building large-scale data-intensive distributed systems on the cloud. Outside of work she’s passionate about volunteering in technical communities, traveling the world, and making music.
Dr. Xin Huang is an applied scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering.
Vedant Jain is a Sr. AI/ML Specialist Solutions Architect, helping customers derive value out of the Machine Learning ecosystem at AWS. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, using Science to lead a meaningful life & exploring delicious vegetarian cuisine from around the world.