简体繁体 English

浏览 pdf 文件以查找特定页面并使用 python 从图像中提取表格数据

[英]Navigate through a pdf file to find specific pages and extract tabular data from image with python

原文 2021-12-15 09:55:34 4 1 python/ python-3.x/ dataframe/ tabular/ image-extraction

I've come across an assignment which requires me to extract tabular data from images in a pdf file to neatly formatted dataframes via python code.我遇到了一项任务，该任务要求我从 pdf 文件中的图像中提取表格数据，以通过 python 代码整齐地格式化数据帧。 There are several files to be processed and the relevant pages in all the files the may have different page numbers, hence the sequence of steps for this problem (my assumption) are:有几个文件要处理，所有文件中的相关页面可能有不同的页码，因此这个问题的步骤顺序（我的假设）是：

Navigate to relevant section of the pdf导航到 pdf 的相关部分
Extract images of the tabular data提取表格数据的图像
Extract data from the images, format and convert to dataframes.从图像中提取数据，格式化并转换为数据帧。

Some google searches resulted in me finding libraries for pdf text extraction, table extraction and more - modular solutions only.一些谷歌搜索导致我找到了 pdf 文本提取、表格提取等库 - 仅限模块化解决方案。

I would appreciate some help in this regard.我将不胜感激在这方面的一些帮助。 What packages should I use?我应该使用哪些软件包？ Is my approach correct?我的方法正确吗？ Can I get references to any helpful code snippets for similar problems?对于类似问题，我可以获得任何有用的代码片段的参考吗？

page structure of the required tables所需表的页面结构

1 个解决方案

This started as a comment.这始于评论。 I believe the answer is valid as it is in no way an endorsement of the service.我相信答案是有效的，因为它绝不是对服务的认可。 I don't even use it.我什至不使用它。 I know Azure uses SO as well.我知道 Azure 也使用 SO。

This is the stuff of commercial services.这是商业服务的东西。 You can try Azure Form Recognizer (with which I am not affiliated):您可以尝试 Azure 表单识别器（与我无关）：

https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer

Here are some python examples of how to use it:以下是一些如何使用它的 python 示例：

https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/how-to-guides/try-sdk-rest-api?pivots=programming-language-python https://docs.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/how-to-guides/try-sdk-rest-api?pivots=programming-language-python

The AWS equivalent is Textract https://aws.amazon.com/textract AWS 等效项是 Textract https://aws.amazon.com/textract

The Google Cloud version is called Form Parser - see https://cloud.google.com/document-ai/docs/processors-list#processor_form-parser Google Cloud 版本称为 Form Parser - 请参阅https://cloud.google.com/document-ai/docs/processors-list#processor_form-parser