简体   繁体   中英

How to convert a PDF to a JPG/PNG in Python with the highest possible quality?

I am tying to convert a PDF to an image so I can OCR it. But the quality is being degraded during the conversion.

There seem to be two main methods for converting a PDF to an image (JPG/PNG) with Python - pdf2image and ImageMagick / Wand .

#pdf2image (altering dpi to 300/600 etc does not seem to make a difference):
pages = convert_from_path("page.pdf", dpi=300)
    for page in pages:
        page.save("page.jpg", 'JPEG')

#ImageMagick (Wand lib)
with Image(filename="page.pdf", resolution=300) as img:
    img.compression_quality = 100
    img.save(filename="page.jpg")

But if I simply take a screenshot of the PDF on a Mac, the quality is higher than using either Python conversion method.

A good way to see this is to run Tesseract OCR on the resulting images - both Python methods give average results, whereas the screenshot gives perfect results. (I've tried both PNG and JPG.)

Assume I have infinite time, computing power and storage space. I am only interested in image quality and OCR output. It's frustrating to have the perfect image just within reach, but not be able to generate it in code.

What is going on here? Is there a better way to convert a PDF? Is there a way I can get more direct control? What is a screenshot doing such a better job than an actual conversion?

You can use PyMuPDF and set the dpi you want:

import fitz

doc = fitz.open('some/pdf/path')
page = doc.load_page(0)
pixmap = page.get_pixmap(dpi=300)
img = pixmap.tobytes()
# Continue with whatever logic...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM