简体   繁体   English

如何使用魔杖 python 将扫描的 pdf 转换为文本

[英]how to convert scanned pdf to text using wand python

While using Wand and imageMagick to convert a scanned PDF to text, I am getting the following error:在使用 Wand 和 imageMagick 将扫描的 PDF 转换为文本时,我收到以下错误:

Error:错误:

Traceback (most recent call last):
  File "C:/Users/gibin/PycharmProjects/ML/Image_PDF/.ksldwjldf.py", line 28, in <module>
    Get_text_from_image(r"C:\Users\gibin\PycharmProjects\ML\Image_PDF\536676972_image.pdf")
  File "C:/Users/gibin/PycharmProjects/ML/Image_PDF/.ksldwjldf.py", line 13, in Get_text_from_image
    pdf=wi(filename=pdf_path,resolution=300)
  File "C:\Users\gibin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\wand\image.py", line 8212, in __init__
    units=units)
  File "C:\Users\gibin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\wand\image.py", line 8686, in read
    self.raise_exception()
  File "C:\Users\gibin\AppData\Local\Programs\Python\Python37-32\lib\site-packages\wand\resource.py", line 240, in raise_exception
    raise e
wand.exceptions.DelegateError: FailedToExecuteCommand `"gswin32c.exe" -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r300x300"  "-sOutputFile=C:/Users/GIBIN_~1./AppData/Local/Temp/magick-23476_sCYGtEq3gb-%d" "-fC:/Users/GIBIN_~1./AppData/Local/Temp/magick-234763X1vpsurlvH5" "-fC:/Users/GIBIN_~1./AppData/Local/Temp/magick-23476fUlS8Tr85dwk"' (The system cannot find the file specified.
) @ error/delegate.c/ExternalDelegateCommand/459

Code:代码:

import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
import gc

pytesseract.pytesseract.tesseract_cmd = r"C:\Users\gibin\AppData\Local\Tesseract-OCR\tesseract.exe"
def Get_text_from_image(pdf_path):
    print(pdf_path)
    pdf=wi(filename=pdf_path,resolution=300)
    pdfImg=pdf.convert('jpeg')
    imgBlobs=[]
    extracted_text=[]
    for img in pdfImg.sequence:
        page=wi(image=img)
        imgBlobs.append(page.make_blob('jpeg'))
        print(len(imgBlobs))
    for imgBlob in imgBlobs:
        im=Image.open(io.BytesIO(imgBlob))
        text=pytesseract.image_to_string(im)
        print(text)
        extracted_text.append(text)
    return ([i.replace("\n","") for i in extracted_text])
Get_text_from_image(r"C:\Users\gibin\PycharmProjects\ML\Image_PDF\536676972_image.pdf")

This is working fine after installing GHOSTSCRIPT and adding it as an environment variable.这在安装 GHOSTSCRIPT 并将其添加为环境变量后工作正常。 Download ghostscript from HERE这里下载 ghostscript

After that, you need to set the environment variable.之后,您需要设置环境变量。 Add a new system variable:添加一个新的系统变量:

Variable: GS_PROG变量:GS_PROG

Value: Full path to the location of your gswin64c.exe file值:gswin64c.exe 文件所在位置的完整路径

Have you seen this Imagemagick Convert PDF to JPEG: FailedToExecuteCommand `"gswin32c.exe" / PDFDelegateFailed ?.你见过这个Imagemagick Convert PDF to JPEG: FailedToExecuteCommand `"gswin32c.exe" / PDFDelegateFailed吗? Instead you can also use other methods to convert pdf to jpg images page wise.相反,您也可以使用其他方法将 pdf 逐页转换为 jpg 图像。 I have used pdf2img library to do that, if you are free to use any library then prefer using pdf2img.我已经使用 pdf2img 库来做到这一点,如果您可以自由使用任何库,那么更喜欢使用 pdf2img。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM