使用python从PDF提取扫描页面

Question

I have a lot of PDF files, which are basically scanned documents so every page is one scanned image. 我有很多PDF文件，它们基本上是扫描的文档，因此每一页都是一张扫描的图像。 I want to perform OCR and extract text from those files. 我想执行OCR并从这些文件中提取文本。 I have tried pytesseract but it does not perform OCR directly on pdf files so as a work around, I want to extract the images from PDF files, save them in directory and then perform OCR using pytesseract on those images directly. 我尝试了pytesseract但是它不能直接在pdf文件上执行OCR ，因此，我想从PDF文件中提取images ，将其保存在目录中，然后使用pytesseract直接在这些图像上执行OCR 。 Is there any way in python to extract scanned images from pdf files? python中有什么方法可以从pdf文件中提取扫描的图像吗？ or is there any way to perform OCR directly on pdf files? 或者有什么方法可以直接在pdf文件上执行OCR ？

Answer 1

This question has been addressed in previous Stack Overflow Posts. 此问题已在之前的Stack Overflow帖子中解决。

Converting PDF to images automatically 自动将PDF转换为图像
Converting a PDF to a series of images with Python 使用Python将PDF转换为一系列图像

Here is a script that may be helpful: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html 这是一个可能有用的脚本： https : //nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html

Another method: https://www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick 另一种方法： https : //www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick

Please check previous posts before asking a question. 提出问题之前，请检查以前的帖子。

EDIT: 编辑：

Including working script for future reference. 包括工作脚本以供将来参考。 Program works for Python3.6 on Windows: 该程序适用于Windows上的Python3.6：

# coding=utf-8
# Extract jpg's from pdf's. Quick and dirty.

import sys

with open("Link/To/PDF/File.pdf", "rb") as file:
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend

使用python从PDF提取扫描页面

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-05-26 16:19:17

使用python从PDF提取扫描页面

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-05-26 16:19:17

解决方案1
2 已采纳 2018-05-26 16:19:17