June 3, 2019

Amazon GAs Textract, Its New Machine Learning-Based OCR Service

Alex Woodie

Companies that rely on optical character recognition (OCR) to digitize the content of printed forms may be interested in Textract, a new machine learning-based OCR service that just became available from Amazon Web Services. The OCR service can digitize simple text as well as more complex data contained in forms and tables.

OCR is rudimentary form of AI, and it isn’t new. In fact, organizations have been using the technology to bridge gap between the world of printed forms and digitized data since 1914, when the earliest OCR technology was developed.

Despite the decades of productive OCR use, there are drawbacks with standard implementations of the technology. One of the biggest disadvantages is the need to continually adapt the OCR rules to account for forms that change over time, not to mention configuring OCR to work with new forms.

This requires an investment in human programmers to maintain the extract rules that guide the OCR. Vendors in the space have made progress in minimizing the amount of OCR maintenance work, but it’s never been zero.

That’s the core area where AWS says it’s made improvements to OCR with Textract. The cloud giant says its technology can automatically detect the layout of a document, including where its key elements reside on the page, without any coding or configuration of templates. AWS also claims that Textract can “understand the data relationships” contained in embedded forms or tables, and maintain the context of the data as it converts it from a document into a digitized form.

The secret to all this automated understanding and maintaining of context, of course, is machine learning. AWS says that it has trained Textract’s machine learning algorithm on millions of documents across multiple industries, ranging from contracts, tax documents, and sales orders to enrollment forms, benefit applications, and insurance claims.

All that training helps to minimize the human maintenance required. The key question, of course, is how much human maintenance will be required to maintain a high level of OCR accuracy with Textract.

AWS claims there won’t be much human intervention required. “You no longer need to maintain code for every document or form you might receive or worry about how page layouts change over time,” the company says.

However, the company isn’t guaranteeing 100% accuracy. The service comes with an “adjustable confidence thresholds” that allows users to configure the system to automatically flag content that Textract is not sure about.

“For instance, if you are extracting information from tax documents and want to ensure high accuracy, then you can create business logic to flag any extracted information with a confidence score lower than 95% to be reviewed by a human,” the company says.

Textract works with a range of document inputs, including scans from multi-function devices, PDFs, and photos. It’s default is to output the OCR extract in JSON format. It will automatically load the JSON into one of several AWS repositories, including Amazon DynamoDB, Amazon S3, Amazon Comprehend, or Amazon Elasticsearch Service.

AWS exposes two APIs with Textract, including the Detect Document Text API and the Analyze Document API. The first API works with textual data, while the second one is designed to work with data contained in simple forms and more complex tables.

The Analyze Document Text API can create key-value pairs from data contained in simple forms. The software automatically maintains the relationships between the individual pieces of source data, which retains the inherent context contained in the document so that it can be referenced later.

Similarly, when using the Analyze Document Text API to extract information from tables contained in more structured documents, such as financial or medical records, Textract gives users the option of loading the data into a pre-defined schema, which will be a useful for loading the data into existing applications or business intelligence reports.

Textract debuted last fall, when AWS CEO Andy Jassy announced the beginning of beta testing for the service during the cloud giant’s Re:Invent conference. AWS announced last week that the testing is now done, and customers can begin using the service on meaningful workloads. However, it’s only available in four AWS regions, including Northern Virginia, Ohio, Oregon, and Ireland.

For more information, see the AWS Textract website.

This Week in Graph and Entity Analytics

Applications: Data Mining

Technologies: Cloud

Sectors: Financial Services, Healthcare

Vendors: AWS

Tags: amazon, Amazon Elasticsearch, Amazon Web Services, document database, json, key-value pairs, machine learning, OCR, Textract

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Amazon GAs Textract, Its New Machine Learning-Based OCR Service

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 22, 2024

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Amazon GAs Textract, Its New Machine Learning-Based OCR Service

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 22, 2024

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link