简体   繁体   English

使用python从PDF提取扫描页面

[英]Extracting scanned pages from PDF using python

I have a lot of PDF files, which are basically scanned documents so every page is one scanned image. 我有很多PDF文件,它们基本上是扫描的文档,因此每一页都是一张扫描的图像。 I want to perform OCR and extract text from those files. 我想执行OCR并从这些文件中提取文本。 I have tried pytesseract but it does not perform OCR directly on pdf files so as a work around, I want to extract the images from PDF files, save them in directory and then perform OCR using pytesseract on those images directly. 我尝试了pytesseract但是它不能直接在pdf文件上执行OCR ,因此,我想从PDF文件中提取images ,将其保存在目录中,然后使用pytesseract直接在这些图像上执行OCR Is there any way in python to extract scanned images from pdf files? python中有什么方法可以从pdf文件中提取扫描的图像吗? or is there any way to perform OCR directly on pdf files? 或者有什么方法可以直接在pdf文件上执行OCR

This question has been addressed in previous Stack Overflow Posts. 此问题已在之前的Stack Overflow帖子中解决。

Converting PDF to images automatically 自动将PDF转换为图像
Converting a PDF to a series of images with Python 使用Python将PDF转换为一系列图像

Here is a script that may be helpful: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html 这是一个可能有用的脚本: https : //nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html

Another method: https://www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick 另一种方法: https : //www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick

Please check previous posts before asking a question. 提出问题之前,请检查以前的帖子。

EDIT: 编辑:

Including working script for future reference. 包括工作脚本以供将来参考。 Program works for Python3.6 on Windows: 该程序适用于Windows上的Python3.6:

# coding=utf-8
# Extract jpg's from pdf's. Quick and dirty.

import sys

with open("Link/To/PDF/File.pdf", "rb") as file:
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM