使用 python 将 pdf 文件转换为 excel 与表中的图像

Question

The pdf files that I need to convert will have images in table.我需要转换的 pdf 文件将在表格中包含图像。 I want to convert the text as well as the images in tables of pdf to excel.我想将 pdf 表格中的文本和图像转换为 excel。 Please suggest me suitable libraries for it.请为我推荐合适的库。

Answer 1

You need to make a script that will read and detect character and excel cells.您需要编写一个脚本来读取和检测字符和 excel 单元格。 You may be able to do that with open-cv and the build-in character recognition tool, but I don't know how easy it would be.你可以用 open-cv 和内置的字符识别工具来做到这一点，但我不知道它有多容易。 Another way I could think of is to make an ML model that will recognise excel sheets from an image.我能想到的另一种方法是制作一个 ML model，它将从图像中识别 excel 表。 This is really hard though and requires a lot of experience.不过这真的很难，需要很多经验。

Answer 2

You can use PikePDF to extract the images from the pdf:您可以使用PikePDF从 pdf 中提取图像：

from pikepdf import Pdf, PdfImage

filename = "sample.pdf"
example = Pdf.open(filename)

for i, page in enumerate(example.pages):
    for j, (name, raw_image) in enumerate(page.images.items()):
        image = PdfImage(raw_image)
        out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")

After extracting the image you can then use OCR to convert the image to a table提取图像后，您可以使用OCR将图像转换为表格

使用 python 将 pdf 文件转换为 excel 与表中的图像

问题描述

2 个解决方案

解决方案1
0 2021-04-10 10:54:12

解决方案2
0 2021-04-10 10:54:30

使用 python 将 pdf 文件转换为 excel 与表中的图像

问题描述

2 个解决方案

解决方案1 0 2021-04-10 10:54:12

解决方案2 0 2021-04-10 10:54:30

解决方案1
0 2021-04-10 10:54:12

解决方案2
0 2021-04-10 10:54:30