简体繁体中英

Aws textract form design best practices

原文 2022-03-22 15:54:22 8 2 amazon-web-services/ amazon-textract

I'm currently redesigning documents and forms for improving the ease of extraction using Aws textract.

Do you have experiences and best practices to share?

Regards

2 answers

AWS Textract is using machine learning algorithms to extract data from forms and tables. Overall, they do not provide any good practices to follow. The idea is, they can extract data no matter what the format.

What I'd suggest, is to do some manual testing. Just see what are the most common problems for current forms or documents you're using. Check were the data is either missing, inconsistent or simply wrongly detected, and try to address that places. Then repeat same process for new forms to see if there's improvement.

Is improving Textract accuracy your only goal? If so, then you probably already are aware of existing issues. Use that knowledge.

In this case it would be extremely helpful to know which place were improved.

What also would be helpful in providing a better answer is the knowledge what types of documents are we talking about. And what frameworks/generators you're using.

Here's some recommended best practices from Amazon Textract Developer Guide in order to Provide an Optimal Input Document :

The following is a list of a few ways that you can optimize your input documents for better results.

Ensure that your document text is in a language that Amazon Textract supports. Currently, Amazon Textract supports English, Spanish, German, Italian, French, and Portuguese.

Provide a high quality image, ideally at least 150 DPI.

If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPEG,and PNG), don't convert or downsample the document before uploading it to Amazon Textract.

For the best results when extracting text from tables in documents, ensure that:

Tables in your document are visually separated from surrounding elements on the page. For example, the table isn't overlaid onto an image or complex pattern.

Text within the table is upright. For example, the text isn't rotated relative to other text on the page. When extracting text from tables, you might see inconsistent results when:

Merged table cells that span multiple columns.

Tables with cells, rows, or columns that are different from other parts of the same table.

I highly suggest you to take a look at the Developer Guide.

AWS Textract Parser

AWS Textract InvalidParameterException

How to customise AWS Textract?

AWS Textract custom font

Using AWS Textract for processing PDF

AWS Textract (OCR) not detecting some cells

Best practices for .NET Core's BackgroundServices hosted on AWS

Best practices when connecting to AWS RDS with Lambda + pymysql?

AWS lambda response ERROR: string indices must be integers (Textract)

Denormalization best practices in Firestore

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question AWS Textract Parser AWS Textract InvalidParameterException How to customise AWS Textract? AWS Textract custom font Using AWS Textract for processing PDF AWS Textract (OCR) not detecting some cells Best practices for .NET Core's BackgroundServices hosted on AWS Best practices when connecting to AWS RDS with Lambda + pymysql? AWS lambda response ERROR: string indices must be integers (Textract) Denormalization best practices in Firestore

Related Tags

Aws textract form design best practices

Question

2 answers

solution1
0 2022-04-23 10:31:02

solution2
0 ACCPTED 2022-04-29 09:14:43

Aws textract form design best practices

Question

2 answers

solution1 0 2022-04-23 10:31:02

solution2 0 ACCPTED 2022-04-29 09:14:43

solution1
0 2022-04-23 10:31:02

solution2
0 ACCPTED 2022-04-29 09:14:43