Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.
Many companies today extract data from documents and forms through manual data entry that’s slow and expensive or through simple optical character recognition (OCR) software that requires manual customization or configuration. Rules and workflows for each document and form often need to be hard-coded and updated with each change to the form or when dealing with multiple forms. If the form deviates from the rules, the output is often scrambled and unusable.
Amazon Textract overcomes these challenges by using machine learning to instantly “read” virtually any type of document to accurately extract text and data without the need for any manual effort or custom code. With Textract you can quickly automate document workflows, enabling you to process millions of document pages in hours. Once the information is captured, you can take action on it within your business applications to initiate next steps for a loan application or medical claims processing. Additionally, you can create smart search indexes, build automated approval workflows, and better maintain compliance with document archival rules by flagging data that may require redaction.
Benefits of Amazon Textract:
- Optical Character Recognition (OCR)
- Machine Learning Backend
- No Machine Learning expertise required
- Extract data quickly & accurately
- No code or templates to maintain
- Lower document processing costs
- Identify Key Value Pairs Automatically
- Identify Table Values Automatically
- Image Scanner
- PDF Scanner
- Detect Latin-Script Characters(English Alphabet) and ASCII Symbols
- Support for PDF, JPG, PNG Document formats
- JPG and PNG File Documents up to 10MB in size
- PDF Documents up to 300MB in size
- PDF Documents of up to 3000 Pages
- “Pay as you go” payment model
- Easy to customize
- Create IAM User with Amazon Textract and Amazon S3 policies attached.
- For PDF & Image Textract options simply include into the configurations your AWS IAM User Access and Secret Access Key and your AWS S3 Bucket Name and you are all set!
- Maximum Textract requires setup of Amazon Lambda/SNS/SQS/SES services. Instructions provided.
Cost of Running Amazon Textract:
- You can use any hosting platform as you prefer for the application itself
- AWS Account (Free to Open – You will be on Free Tier for the 1st year)
- Amazon S3 Storage Cost (For Data Storage and Data Traffic Out)
With Amazon Textract you pay only for what you use. There are no minimum fees and no upfront commitments. Amazon Textract charges you for each page you process and whether you extract only text from documents or text with tables and/or form data. A single page may contain between 0 and 3,000 words.
Detect Document Text API: The Detect Document Text API uses optical character recognition (OCR) technology to extract text from a provided document.
Analyze Document API: The Analyze Document API extracts data from tables and key-value pairs from forms. For example, the form label for “First Name” and the associated value. OCR is performed for free using the Detect Document Text API when using the Analyze Document API.
You can get started for free with the AWS Free Tier. New AWS customers can analyze up to 1,000 pages per month using the Detect Document Text API and up to 100 pages per month using the Analyze Document API, for the first three months.
- For Up-to-Date Prices – Click Here
- AWS PHP SDK v3 is Required (Already comes with the App) – Setup Link
- AWS IAM User with Amazon Textract and Amazon S3 Access Policies attached – Setup Link
- Amazon S3 Bucket with Public Access – Setup Link
- Also Listed and Explained in the Documentation
- For setting up Maximum Textract see the Documentation
Release Notes – Change Logs:
20.06.2020 - 1.0.0 - Update: Documentation - Fix: Lambda function minor fix 08.05.2020 - 1.0.0 - Initial Release