简体   繁体   English

如何将 PDF 转换为 opencv-python 可读的图像?

[英]How to convert PDF into image readable by opencv-python?

I am using following code to draw rectangle on an image text for matching date pattern and its working fine.我正在使用以下代码在图像文本上绘制矩形以匹配日期模式及其工作正常。

import re
import cv2
import pytesseract
from PIL import Image
from pytesseract import Output

img = cv2.imread('invoice-sample.jpg')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
keys = list(d.keys())

date_pattern = '^(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(19|20)\d\d$'

n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        if re.match(date_pattern, d['text'][i]):
            (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
            img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

cv2.imshow('img', img)
cv2.waitKey(0)
img.save("sample.pdf")

Now, at the end I am getting a PDF with rectangle on matched date pattern.现在,最后我得到了一个 PDF 在匹配的日期模式上带有矩形。

I want to give this program scanned PDF as input instead of image above.我想给这个程序扫描 PDF 作为输入而不是上面的图像。 It should first convert PDF into image format readable by opencv for same processing as above.它首先应将 PDF 转换为 opencv 可读的图像格式,进行与上述相同的处理。 Please help.请帮忙。 (Any workaround is fine. I need a solution in which I can convert PDF to image and use it directly instead of saving on disk and read them again from there. As I have lot of PDFs to process.) (任何解决方法都可以。我需要一个解决方案,我可以将 PDF 转换为图像并直接使用它,而不是保存在磁盘上并从那里再次读取它们。因为我有很多 PDF 需要处理。)

There is a library named pdf2image.有一个名为 pdf2image 的库。 You can install it with pip install pdf2image .您可以使用pip install pdf2image安装它。 Then, you can use the following to convert pages of the pdf to images of the required format:然后,您可以使用以下命令将 pdf 的页面转换为所需格式的图像:

from pdf2image import convert_from_path

pages = convert_from_path("pdf_file_to_convert")
for page in pages:
    page.save("page_image.jpg", "jpg")

Now you can use this image to apply opencv functions.现在您可以使用此图像应用 opencv 函数。

You can use BytesIO to do your work without saving the file:您可以使用BytesIO来完成您的工作,而无需保存文件:

from io import BytesIO
from PIL import Image

with BytesIO() as f:
   page.save(f, format="jpg")
   f.seek(0)
   img_page = Image.open(f)

You can use the library pdf2image .您可以使用库pdf2image Install with this command: pip install pdf2image .使用以下命令安装: pip install pdf2image You can then convert the file into one or multiple images readable by cv2.然后,您可以将文件转换为 cv2 可读的一个或多个图像。 The next sample of code will convert the PIL Image into something readable by cv2:下一个代码示例会将 PIL 图像转换为 cv2 可读的内容:

Note: The following code requires numpy pip install numpy .注意:以下代码需要numpy pip install numpy

from pdf2image import convert_from_path
import numpy as np

images_of_pdf = convert_from_path('source2.pdf')  # Convert PDF to List of PIL Images
readable_images_of_pdf = []  # Create a list for thr for loop to put the images into
for PIL_Image in images_of_pdf:
    readable_images_of_pdf.append(np.array(PIL_Image))  # Add items to list

The next bit of code can convert the pdf into one big image readable by cv2:下一段代码可以将 pdf 转换成 cv2 可读的大图:

import cv2
import numpy as np
from pdf2image import convert_from_path

image_of_pdf = np.concatenate(tuple(convert_from_path('/path/to/pdf/source.pdf')), axis=0)

The pdf2image library's convert_from_path() function returns a list containing each pdf page in the PIL image format. pdf2image 库的convert_from_path() function 返回一个列表,其中包含 PIL 图像格式的每个 pdf 页面。 We convert the list into a tuple for the numpy concatenate function to stack the images on top of each other.我们将列表转换为 numpy 连接 function的元组以将图像堆叠在一起。 If you want them side by side you could change the axis integer to 1 signifying you want to concatenate the images along the y-axis.如果您希望它们并排,您可以将轴 integer 更改为 1,表示您想要沿 y 轴连接图像。 This next bit of code will show the image on the screen:下一段代码将在屏幕上显示图像:

cv2.imshow("Image of PDF", image_of_pdf)
cv2.waitKey(0)

This will probably create a window on the screen that is too big.这可能会在太大的屏幕上创建 window。 To resize the image for the screen you'll use the following code that uses cv2's built-in resize function:要调整屏幕图像的大小,您将使用以下代码,该代码使用 cv2 的内置调整大小 function:

import cv2
from pdf2image import convert_from_path
import numpy as np
image_of_pdf = np.concatenate(tuple(convert_from_path('source2.pdf')), axis=0)
size = 0.15 # 0.15 is equal to 15% of the original size.
resized = cv2.resize(image_of_pdf, (int(image_of_pdf.shape[:2][1] * size), int(image_of_pdf.shape[:2][0] * size)))
cv2.imshow("Image of PDF", resized)
cv2.waitKey(0)

On a 1920x1080 monitor, a size of 0.15 can comfortably display a 3-page document.在 1920x1080 的显示器上,0.15 的大小可以舒适地显示 3 页的文档。 The downside is that the quality is reduced dramatically.缺点是质量大大降低。 If you want to have the pages separated you can just use the original convert_from_path() function.如果您想将页面分开,您可以使用原始的convert_from_path() function。 The following code shows each page individually, to go to the next page press any key:以下代码分别显示每一页,到 go 到下一页按任意键:

import cv2
from pdf2image import convert_from_path
import numpy

images_of_pdf = convert_from_path('source2.pdf')  # Convert PDF to List of PIL Images
count = 0  # Start counting which page we're on
while True:
    cv2.imshow(f"Image of PDF Page {count + 1}", numpy.array(images_of_pdf[count]))  # Display the page with it's number
    cv2.waitKey(0)  # Wait until key is pressed
    cv2.destroyWindow(f"Image of PDF Page {count + 1}")  # Destroy the following window
    count += 1  # Add to the counter by 1
    if count == len(images_of_pdf):
        break  # Break out of the while loop before you get an "IndexError: list index out of range"

From PDF to opencv ready array in two lines of code.两行代码从 PDF 到 opencv 就绪数组。 I have also added the code to resize and view the opencv image.我还添加了代码来调整和查看 opencv 图像。 No saving to disk.不保存到磁盘。

# imports
from pdf2image import convert_from_path
import cv2
import numpy as np

# convert PDF to image then to array ready for opencv
pages = convert_from_path('sample.pdf')
img = np.array(pages[0])

# opencv code to view image
img = cv2.resize(img, None, fx=0.5, fy=0.5)
cv2.imshow("img", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Remember if you do not have poppler in your Windows PATH variable you can provide the path to convert_form_path请记住,如果您的 Windows PATH 变量中没有 poppler,您可以提供convert_form_path的路径

poppler_path = r'C:\path_to_poppler'
pages = convert_from_path('sample.pdf', poppler_path=poppler_path)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM