[英]Extracting scanned pages from PDF using python
I have a lot of PDF
files, which are basically scanned documents so every page is one scanned image. 我有很多
PDF
文件,它们基本上是扫描的文档,因此每一页都是一张扫描的图像。 I want to perform OCR
and extract text from those files. 我想执行
OCR
并从这些文件中提取文本。 I have tried pytesseract
but it does not perform OCR
directly on pdf
files so as a work around, I want to extract the images
from PDF
files, save them in directory and then perform OCR
using pytesseract
on those images directly. 我尝试了
pytesseract
但是它不能直接在pdf
文件上执行OCR
,因此,我想从PDF
文件中提取images
,将其保存在目录中,然后使用pytesseract
直接在这些图像上执行OCR
。 Is there any way in python to extract scanned images from pdf
files? python中有什么方法可以从
pdf
文件中提取扫描的图像吗? or is there any way to perform OCR
directly on pdf files? 或者有什么方法可以直接在pdf文件上执行
OCR
?
This question has been addressed in previous Stack Overflow Posts. 此问题已在之前的Stack Overflow帖子中解决。
Converting PDF to images automatically 自动将PDF转换为图像
Converting a PDF to a series of images with Python 使用Python将PDF转换为一系列图像
Here is a script that may be helpful: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html 这是一个可能有用的脚本: https : //nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html
Another method: https://www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick 另一种方法: https : //www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick
Please check previous posts before asking a question. 提出问题之前,请检查以前的帖子。
EDIT: 编辑:
Including working script for future reference. 包括工作脚本以供将来参考。 Program works for Python3.6 on Windows:
该程序适用于Windows上的Python3.6:
# coding=utf-8
# Extract jpg's from pdf's. Quick and dirty.
import sys
with open("Link/To/PDF/File.pdf", "rb") as file:
pdf = file.read()
startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print("JPG %d from %d to %d" % (njpg, istart, iend))
jpg = pdf[istart:iend]
with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
jpgfile.write(jpg)
njpg += 1
i = iend
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.