[英]converting pdf file to excel with images in the table using python
The pdf files that I need to convert will have images in table.我需要转换的 pdf 文件将在表格中包含图像。 I want to convert the text as well as the images in tables of pdf to excel.
我想将 pdf 表格中的文本和图像转换为 excel。 Please suggest me suitable libraries for it.
请为我推荐合适的库。
You need to make a script that will read and detect character and excel cells.您需要编写一个脚本来读取和检测字符和 excel 单元格。 You may be able to do that with open-cv and the build-in character recognition tool, but I don't know how easy it would be.
你可以用 open-cv 和内置的字符识别工具来做到这一点,但我不知道它有多容易。 Another way I could think of is to make an ML model that will recognise excel sheets from an image.
我能想到的另一种方法是制作一个 ML model,它将从图像中识别 excel 表。 This is really hard though and requires a lot of experience.
不过这真的很难,需要很多经验。
You can use PikePDF to extract the images from the pdf:您可以使用PikePDF从 pdf 中提取图像:
from pikepdf import Pdf, PdfImage
filename = "sample.pdf"
example = Pdf.open(filename)
for i, page in enumerate(example.pages):
for j, (name, raw_image) in enumerate(page.images.items()):
image = PdfImage(raw_image)
out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")
After extracting the image you can then use OCR to convert the image to a table提取图像后,您可以使用OCR将图像转换为表格
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.