Training a call center fraud detection model for IVR calls with Amazon SageMaker Canvas

Fraud detection is a critical challenge for call centers, they need to provide a seamless customer experience while protecting the organization from fraudulent activities. Traditionally, call centers have relied on agents to manually screen calls, which can be time-consuming and expensive. Alternatively, companies might force customers to authenticate themselves every time they call, leading to a poor user experience. Machine learning (ML) offers a powerful solution that can help organization reach a harmonious balance between these approaches, enabling efficient and accurate fraud detection without compromising the customer experience.

This blog post will show you how to use the power of ML to build a fraud-detection model using Amazon SageMaker Canvas, a no-code/low-code ML service that business analysts and domain experts can use to build, train, and deploy ML models without requiring extensive ML expertise.

Background

In this solution, you will use a contact trace record (CTR) dataset from Amazon Connect. The solution works with data from other inbound telephony services provided they contain call specific metadata. Importantly, each call has been previously labeled based on past fraud detection efforts by the call center.

To start, you will enrich the raw CTR data using the phone number validation service from Amazon Pinpoint. Then, you will prepare the data using Amazon SageMaker notebooks and train a fraud detection model using SageMaker Canvas. Finally, to understand how to provide a scalable and cost-effective solution, you will explore how to deploy the model to a Sagemaker endpoint and right-size through autoscaling in a managed, serverless environment.

The following figure shows the architecture with the complete process including data enrichment, merging, cleanup, and model training and deployment. Throughout this blogpost we will reference the key parts as we implement them.

Architecture diagram

Data enrichment and preparation

Each CTR contains information about an incoming call such as agent, connection attempts, and channel. Before proceeding, you must transform the data into a tabular form (CSV, Parquet, or tables), which are the supported formats in SageMaker Canvas.

The raw CTR dataset doesn’t contain enough information to effectively train an ML model. You will enrich it using the Amazon Pinpoint validate API, which provides additional fields such as carrier, location data, and phone type. After you have the enriched dataset, you can use a SageMaker notebook to clean and prepare the data for training.

Enriching the data with Amazon Pinpoint validate API
Amazon Pinpoint includes a phone number validation service that you can use to determine if a phone number is valid and to obtain additional contact information. For example, the API response for a valid mobile phone number would look like the following:

{
    "NumberValidateResponse": {
        "Carrier": "ExampleCorp Mobile",
        "City": "Seattle",
        "CleansedPhoneNumberE164": "+12065550142",
        "CleansedPhoneNumberNational": "2065550142",
        "Country": "United States",
        "CountryCodeIso2": "US",
        "CountryCodeNumeric": "1",
        "OriginalPhoneNumber": "+12065550142",
        "PhoneType": "MOBILE",
        "PhoneTypeCode": 0,
        "Timezone": "America/Los_Angeles",
        "ZipCode": "98101"
    }
}

While the response for an invalid phone number would contain the following:

{
    "NumberValidateResponse": {
        "CleansedPhoneNumberE164": "+44163296076",
        "CleansedPhoneNumberNational": "163296076",
        "Country": "United Kingdom",
        "CountryCodeIso2": "GB",
        "CountryCodeNumeric": "44",
        "OriginalPhoneNumber": "+440163296076",
        "PhoneType": "INVALID",
        "PhoneTypeCode": 3
    }
}

There are multiple ways to enrich your dataset with this API. The response from the validate API call is a JSON document that contains all the fields. You will need to convert them to the appropriate format and merge them with the original CTR dataset. This will give you an enriched dataset that includes all the fields from the CTR and their corresponding validate results. The goal is to extract the caller phone number from each row in the dataset (CustomerEndpoint in Connect) and run it against the Amazon Pinpoint validate API.

The following figure shows an example architecture used to enrich with Amazon Pinpoint, cache in Amazon DynamoDB, and merge the datasets with Lambda functions. The DynamoDB cache saves the responses from Pinpoint by input key so that you don’t have to re-validate phone numbers.

Portion of the architecture focusing on the validation process

Data engineering and data preparation

The first step is to clean up the data by discarding fields that aren’t adding any predictive value. This will make the training process faster and the model more accurate. For example, in the CTR data, the AWSAccountId field will be the same for all records, and the ContactId field will be unique, so you can discard them both. The goal is to simplify the dataset as much as possible using SageMaker Canvas to perform multiple experiments to better understand which fields in the data have the biggest impact on the prediction.

After you have reduced the dataset, you can select a random, smaller subset of data (for example, 1,000 records) and build a model in SageMaker Canvas. SageMaker Canvas provides a visual interface that you can use to rapidly build, train, and deploy ML models without the need for extensive coding.

To begin the training process, you will first import the prepared, enriched data file containing the CTR data into SageMaker Canvas. SageMaker Canvas will automatically detect the data format and structure, allowing you to preview the dataset and make any necessary adjustments before proceeding.

After the dataset is ready, you can select the fraud label as your target variable and configure the model training parameters. SageMaker Canvas will handle the underlying ML algorithms and hyperparameter tuning, streamlining the model development process.

After the model is trained, SageMaker Canvas will show an analysis of the fields and their weights. In addition to the build step—where you can drop and transform your imported dataset—SageMaker Canvas also includes Amazon SageMaker Data Wrangler, which you can use to prepare, featurize, and analyze your data. This allows you to do additional transformations as you experiment and iterate. You should continue to run these small experiments while making changes to the training dataset with the goal of getting the best performance from your model, while reducing the training time and the latency in inference.

After experimenting, you will identify the most important features and how you want to transform your data. When you’re confident in your model’s performance, you can decide how to deploy the trained model.

Final model deployment and right-sizing

Once your model is trained, you will need to deploy an endpoint so that you can invoke your model and consume the results via API. For example, the diagram below shows a Lambda Function calling the deployed endpoint. We continue to use the DynamoDB table as a cache. This way we avoid reprocessing numbers.

Portion of the architecture focusing on SageMaker model invocation

SageMaker Canvas offers a default deployment option where you can choose the instance type and number of instances to host the trained model that will meet your scaling needs. To help ensure that these instances aren’t over- or under-provisioned, SageMaker provides multiple features to help you optimize your deployment to your specific needs.

Autoscaling: Autoscaling dynamically adjusts the number of instances provisioned for a model in response to workload changes. The steps to make this change are described in detail in Configure model autoscaling. SageMaker recently introduced Scale Down to Zero for AI inference, which allows endpoints to scale to zero instances during periods of inactivity and can help customers save costs.
Inference recommender: Amazon SageMaker Inference Recommender reduces the time required to get ML models into production by automating load testing and model tuning across SageMaker ML instances.
Serverless deployment: Amazon SageMaker Serverless Inference is a purpose-built inference option that you can use to deploy and scale ML models without configuring or managing any of the underlying infrastructure. Compute resources scale automatically depending on traffic, eliminating the need to choose instance types or manage scaling policies. To learn more, see Serverless endpoint operations.

To have use a serverless deployment and obtain inference recommendations, a single-container model is needed. This is currently not possible when using SageMaker Data Wrangler, because it generates two containers following data transformation: one for data preprocessing and one for model prediction.

After preparing the final dataset, you can bring it into SageMaker Canvas to train the detection model. Follow the same steps as you did with the smaller tests, making sure that you don’t use any of the data transformation options within SageMaker Canvas.

Depending on the size and complexity of the final dataset, the training process can take several hours to complete. After the training is complete, SageMaker Canvas will generate a trained model that you can immediately deploy to an endpoint.

Conclusion

By using the power of Amazon SageMaker Canvas, you can build, train, and deploy a robust fraud detection model, empowering your call center to deliver exceptional customer experiences while safeguarding your business and allowing your human agents to focus on legitimate customers.

You can begin testing SageMaker Canvas using the AWS Management Console today or learn more about SageMaker Canvas basics at the SageMaker Canvas Immersion Day workshop.

Select your cookie preferences

AWS Architecture Blog

Training a call center fraud detection model for IVR calls with Amazon SageMaker Canvas

Background

Data enrichment and preparation

Data engineering and data preparation

Final model deployment and right-sizing

Conclusion

About the authors

Resources

Follow