繁体 English 中英

如何从 PDF 或 Word 中提取图像以及图像周围的文本？

[英]How to extract images from PDF or Word, together with the text around images?

原文 2019-04-09 09:15:48 9 3 python/ shell/ pdf/ ms-word/ image-extraction

我发现有一些库可以从 PDF 或 word 中提取图像，例如 docx2txt 和 pdfimages。 但是如何获取图像周围的内容（例如图像下方可能有标题）？ 或者获取每张图片的页码？

其他一些工具如 PyPDF2 和 minecart 可以逐页提取图像。 但是，我无法成功运行这些代码。

有没有什么好的方法来获取图像的一些信息？ （来自从 docx2txt 或 pdfimages 获得的图像，或另一种提取带有信息的图像的方法）

3 个解决方案

我找到了doc2txt的代码，它只是解析docx文件的xml。 所以这实际上是一项非常简单的任务..

参考： doc2txt

docx2python将图像拉入文件夹，并在提取的文本中留下-----image1.png----标记。 这可能会让你接近你想去的地方。

几个月前，我重新编程了 docx2python 以从 docx 文件中再现结构化（具有级别）的 xml 格式文件，这在许多文件上都运行良好。

据我所知，一个段落包含多个运行，每个运行仅包含一个文本，有时包含图像。 您可以阅读此文档以了解详细信息。 https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 。

docx2python 支持提取带有文本周围的图像。 您使用 docx2python 阅读段落，而----media/imagen----显示在您的文本中，这是一个图像占位符。 如果你设置了extract_image=True你就可以到达这个图像。 好吧，您将在 pagaraph 文本和图像文件列表中获得您的图像。 随心搭配。

如何从 PDF 中提取文本，包括图像和文本

[英]How extract text from PDF including images and text

如何使用pytesseract从pdf文件中的图像中提取文本

[英]how to extract text from images in a pdf file using pytesseract

从扫描的 pdf 中提取带有图像的文本？

[英]Extract text from a scanned pdf with images?

如何在纯 Python 中从 PDF 中提取图像？

[英]How to extract images from a PDF in pure Python?

从 pdf 中提取图像作为 pdf

[英]Extract images from a pdf as pdfs

使用 Python 从 word 文档中提取图像和文本

[英]Using Python to extract images and text from a word document

Pdf矿工如何提取图片

[英]Pdf miner how to extract images

使用 Python，如何从 PDF 中提取文本和图像 + 从 output txt 文件中提取颜色字符串和数字

[英]Using Python, how to extract text and images from PDF + color strings and numbers from the output txt file

如何从这些彩色图像中提取文字？

[英]How to extract text from these colored images?

如何从python（regex）的文本中提取图像

[英]How to extract images from a Text in python (regex)

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 PDF 中提取文本，包括图像和文本如何使用pytesseract从pdf文件中的图像中提取文本从扫描的 pdf 中提取带有图像的文本？如何在纯 Python 中从 PDF 中提取图像？从 pdf 中提取图像作为 pdf 使用 Python 从 word 文档中提取图像和文本 Pdf矿工如何提取图片使用 Python，如何从 PDF 中提取文本和图像 + 从 output txt 文件中提取颜色字符串和数字如何从这些彩色图像中提取文字？如何从python（regex）的文本中提取图像

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM