TextRay from Systems Limited is a Solution on AWS for Extracting Information from Scanned Documents

By Faizan Siddiqui, AWS AI/ML Sr. Consultant – Systems Limited
By Hamza Awan, AWS AI/ML Consultant – Systems Limited
By Abdullah Jamshed, AWS Cloud Lead – Systems Limited
By Cjin Pheow Lee, Sr. Partner Solutions Architect – AWS

Systems Limited

The field of information extraction (IE) from digital documents is undergoing intensive research, with a broad range of industrial applications. As organizations strive to improve data integrity, encryption, and information security, many are transitioning to digitized documents.

Documents serve as a cornerstone for recordkeeping, communication, collaboration, and transactions in various industries such as finance, healthcare, law, and real estate. Millions of mortgage applications and hundreds of millions of W2 tax forms processed annually are just a few examples of such documents that contain a wealth of information.

However, extracting insights from these unstructured documents is often challenging, requiring time-intensive and complicated processes for search and discovery, automating business processes, and ensuring compliance.

TextRay from Systems Limited sets itself apart from other information extraction solutions by offering a unique capability: the ability to train the model using custom data. This feature empowers users to extract even the most intricate values from various types of documents.

The information extraction solution utilizes Amazon Web Services (AWS) to automatically extract information from scanned documents, reducing processing time while boosting accuracy and productivity for organizations in order to improve operational efficiency.

In this post, we will illustrate how you can make use of TextRay to extract text and data from scanned documents automatically and with great efficiency. Specifically, we’ll guide you on how to set up the solution with AWS services for optimal extraction results.

Additionally, with AWS handling the deployment of TextRay in a highly scalable and highly available environment, you can easily access the model with straightforward API actions. The sample code for deploying and testing the TextRay solution is available on Bitbucket.

Systems Limited is an AWS Specialization Partner and leading global systems integrator that aims to enhance the productivity and growth of global organizations with comprehensive cloud-first adoption strategies.

Customer Challenges

As organizations strive to extract meaningful insights and knowledge from large volumes of data, information extraction has become a critical aspect of their operations. The following issues are frequently faced by customers:

Manual information extraction: The process of manually extracting information from documents can be time-consuming and labor-intensive, especially for large volumes of documents.
Costly current solutions: Existing solutions for information extraction can be expensive and not cost-effective for organizations.
Poor accuracy: Accuracy of existing solutions for information extraction can be lacking, leading to incorrect or incomplete information.
Slow document processing: The process of extracting information from documents can be slow, particularly when dealing with large volumes of documents.
Incomplete extraction of critical fields: Existing solutions may not extract all of the important fields or extract fields which are not required, resulting in irrelevant, missing, or incomplete information.
Complex data sources: Sources of data, such as documents, can be complex and challenging to extract information from, requiring specialized skills and techniques.
Limited data accessibility: Access to the data, especially for organizations, may be restricted, making it difficult to ingest data in a secure way to solutions for information extraction.

Solution Overview

Recognizing these challenges, Systems Limited developed an automated on-demand solution that extracts information from documents with precision, allowing customers to increase efficiency, reduce errors, and achieve their goals.

The TextRay solution is accurate and cost-effective, scalable, highly available. Here’s how it works:

TextRay begins with the submission of PDFs containing scanned documents.
The PDF to Images module extracts the images from the PDFs and forwards them to a deep learning model for information detection.
Model detects all necessary and key information such as name, address, and expense from the scanned images on the principles of object detection.
Detected information cells are cropped out from the original images and passed on the post-processing module.
Post-processing module assists in transforming the extracted information into a dictionary format that’s useful and usable for modelling.
In the end, the CSV is generated which contains the information in key-value pairs format.

Figure 1 – Abstract flow to understand the working of the system.

Data Preparation

Data labelling is a labor-intensive task, often demanding significant manual effort. Image annotation involves the act of labelling images within a dataset to facilitate the training of a detection model.

To enhance the dataset, TextRay employs labelling software named Labellmg, which allows users to manually create annotations for each image. This involves drawing bounding boxes around the objects of interest within the image and associating them with class labels.

Finally, these annotations are exported in the desired format, such as COCO, JSON, YOLO, and others, to support further model development.

Technical Architecture

TextRay leverages multiple AWS services to store and process data, train and roll out the model, and manage incoming inference requests to the solution.

Figure 2 – Technical architecture diagram of the solution.

Information extraction is the cornerstone of the TextRay solution, and it’s performed on an Amazon Elastic Compute Cloud (Amazon EC2) GPU instance and through an Amazon SageMaker endpoint. The effectiveness and efficiency of the information extraction process hinge on the quality of the data stored in the Amazon Simple Storage Service (Amazon S3) bucket.

Below are steps involved in the TextRay solution and the AWS services leveraged:

Customer data arrives in the designated Amazon S3 bucket.
Data is retrieved by the Nvidia GPU-based EC2 instance where the data is preprocessed and the learning model is trained.
Trained model artefacts are saved in the S3 bucket to be used for deployment in the next step.
Saved model is deployed on the SageMaker endpoint to make it accessible to the customer for inference.
SageMaker real-time inference is deployed for model inferencing, providing high scalability by configuring the number and size of instances and allowing for automatic scaling based on the input size.
AWS Lambda function is established to programmatically call the SageMaker endpoint for inference using the input data from the S3 bucket.
Lambda function takes test data as input from the S3 bucket and requests the inference from the SageMaker endpoint, and then saves the resultant extracted information in the S3 bucket in the CSV format.

Customer Benefits

TextRay harnesses the power of AWS services to create an efficient infrastructure for the solution. TextRay is cost-effective as most of the AWS services utilized in the solution follow a pay-as-you-go pricing model. Note that an estimated monthly cost example is provided in the cost section below.

This solution employs a deep learning model to accurately detect and extract information from scanned documents, providing precise text extraction from documents by identifying only the necessary information. The combination of this advanced model and the scalability and versatility of AWS services allow fast processing speed.

By providing a centralized access point to the solution, TextRay eliminates the need for customers to train resources to use the solution, optimizing the utilization of human resources. It also leverages Amazon S3 for data storage, ensuring the security of both customer and organizational data.

Deployment

TextRay has provided an AWS CloudFormation template to deploy the necessary resources to test the solution. This template utilizes the base model’s artefacts to create a SageMaker model and deploy it as a real-time inference endpoint.

The Lambda function to invoke the endpoint is also created by this template, and is built using an Amazon Elastic Container Registry (Amazon ECR) image which can be built and pushed to ECR repository following the below steps.

Download the Lambda function code provided and save it in a directory.
Move into the directory containing the Lambda code and build a Docker image using the following command:
```
docker build . -t sagemaker-textray-inference:latest
```

aws ecr get-login-password --region region | docker login --username AWS --password-stdin account_id.dkr.ecr.region.amazonaws.com

Create an ECR repository:

aws ecr create-repository --repository-name sagemaker-textray-inference

Tag the image built previously with the ECR repository:

docker tag sagemaker-textray-inference:latest account_id.dkr.ecr.region.amazonaws.com/sagemaker-textray-inference:latest

Push the image to the ECR repository:

docker push account_id.dkr.ecr.region.amazonaws.com/sagemaker-textray-inference:latest

After the image is pushed to the ECR repository, use the image URI in the CloudFormation template as parameter during the stack deployment.

Cost

TextRay’s cost advantage allows organizations to make the most of their resources while enjoying the benefits of advanced information extraction technology. The estimated monthly costs for running the TextRay application on AWS in the US East (N. Virginia) region are as follows:

SageMaker real-time inference – $547.756
AWS Lambda for TextRay invoke endpoint – $0
Amazon S3 for TextRay data storage – $0.03

The one-time training cost for the EC2 instance (g4dn.xlarge) is $8.416. These figures provide an overview of the expected expenses involved in deploying and maintaining TextRay on AWS. It’s essential to be mindful of resource usage to optimize costs and ensure efficient operation of the application.

Please note this estimate is for processing 6,000 documents per month. With the increase in the number of documents, the cost of Lambda and S3 will increase accordingly.

Results and Discussion

Systems Limited is offering the foundational TextRay solution model to showcase its exceptional accuracy and efficient document processing, leveraging AWS services. This model has undergone training using scanned documents, and these scanned documents contain forms and tables.

TextRay’s model recognizes the structure of tables and forms, subsequently identifying essential and pivotal information, as shown in figure below.

Figure 3 – Information detection results.

The post-processing module receives the identified data, employs optical character recognition (OCR) to extract the tangible text, and then stores this extracted data in a CSV file using the key-value pair structure, as shown below.

Figure 4 – Information extraction results in CSV format.

Conclusion

TextRay from Systems Limited strikes a balance between advanced technology and practicality, offering cost-effectiveness to automatically extract text and valuable information from scanned documents.

TextRay delivers tangible benefits to customers by significantly reducing turnaround time and minimizing the need for manual intervention, all while maintaining a consistently high level of accuracy in most cases.

The sample code for deploying and testing the TextRay solution is available on Bitbucket. The pre-trained base model, which can be used with the provided AWS CloudFormation deployment template, is hosted on an Amazon S3 bucket and can be downloaded from there.

.

.

Systems Limited – AWS Partner Spotlight

Systems Limited is an AWS Specialization Partner and leading global systems integrator that aims to enhance the productivity and growth of global organizations with comprehensive cloud-first adoption strategies.

Contact Systems Limited | Partner Overview

AWS Partner Network (APN) Blog