简体   繁体   English

AWS textract 表单设计最佳实践

[英]Aws textract form design best practices

I'm currently redesigning documents and forms for improving the ease of extraction using Aws textract.我目前正在重新设计文档和 forms,以提高使用 Aws textract 提取的便利性。

Do you have experiences and best practices to share?您有经验和最佳实践可以分享吗?

Regards问候

AWS Textract is using machine learning algorithms to extract data from forms and tables. AWS Textract 使用机器学习算法从 forms 和表中提取数据。 Overall, they do not provide any good practices to follow.总的来说,他们没有提供任何可遵循的良好做法。 The idea is, they can extract data no matter what the format.这个想法是,无论格式如何,他们都可以提取数据。

What I'd suggest, is to do some manual testing.我的建议是进行一些手动测试。 Just see what are the most common problems for current forms or documents you're using.只需查看当前 forms 或您正在使用的文档最常见的问题是什么。 Check were the data is either missing, inconsistent or simply wrongly detected, and try to address that places.检查数据是否丢失、不一致或只是错误检测,并尝试解决这些问题。 Then repeat same process for new forms to see if there's improvement.然后对新的 forms 重复相同的过程,看看是否有改进。

Is improving Textract accuracy your only goal?提高 Textract 的准确性是您唯一的目标吗? If so, then you probably already are aware of existing issues.如果是这样,那么您可能已经意识到存在的问题。 Use that knowledge.使用这些知识。

In this case it would be extremely helpful to know which place were improved.在这种情况下,了解改进了哪些地方将非常有帮助。

What also would be helpful in providing a better answer is the knowledge what types of documents are we talking about.了解我们所讨论的文档类型也有助于提供更好的答案。 And what frameworks/generators you're using.以及您使用的框架/生成器。

Here's some recommended best practices from Amazon Textract Developer Guide in order to Provide an Optimal Input Document :以下是 Amazon Textract 开发人员指南中推荐的一些最佳实践,以提供最佳输入文档

The following is a list of a few ways that you can optimize your input documents for better results.以下是您可以优化输入文档以获得更好结果的几种方法的列表。

  • Ensure that your document text is in a language that Amazon Textract supports.确保您的文档文本使用 Amazon Textract 支持的语言。 Currently, Amazon Textract supports English, Spanish, German, Italian, French, and Portuguese.目前,Amazon Textract 支持英语、西班牙语、德语、意大利语、法语和葡萄牙语。
  • Provide a high quality image, ideally at least 150 DPI.提供高质量的图像,理想情况下至少为 150 DPI。
  • If your document is already in one of the file formats that Amazon Textract supports (PDF, TIFF, JPEG,and PNG), don't convert or downsample the document before uploading it to Amazon Textract.如果您的文档已经是 Amazon Textract 支持的其中一种文件格式(PDF、TIFF、JPEG 和 PNG),请不要在将文档上传到 Amazon Textract 之前对其进行转换或缩减采样。

For the best results when extracting text from tables in documents, ensure that:从文档中的表格中提取文本时,为了获得最佳结果,请确保:

  • Tables in your document are visually separated from surrounding elements on the page.文档中的表格在视觉上与页面上的周围元素分开。 For example, the table isn't overlaid onto an image or complex pattern.例如,表格不会叠加在图像或复杂图案上。
  • Text within the table is upright.表格内的文字是直立的。 For example, the text isn't rotated relative to other text on the page.例如,文本不会相对于页面上的其他文本旋转。 When extracting text from tables, you might see inconsistent results when:从表中提取文本时,您可能会在以下情况下看到不一致的结果:
  • Merged table cells that span multiple columns.跨越多列的合并表格单元格。
  • Tables with cells, rows, or columns that are different from other parts of the same table.包含与同一表格的其他部分不同的单元格、行或列的表格。

I highly suggest you to take a look at the Developer Guide.我强烈建议您查看开发人员指南。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM