简体   繁体   English

如何使用pytesseract从pdf文件中的图像中提取文本

[英]how to extract text from images in a pdf file using pytesseract

I am trying to use below code for extracting text from images of a pdf file. 我正在尝试使用以下代码从pdf文件的图像中提取文本。 The PDF file is a Contract Document which is a scanned copy of a Contract. PDF文件是合同文档,它是合同的扫描副本。 All the pages in the pdf file are images. pdf文件中的所有页面均为图像。

When I tried using the below code to extract data, I am getting an error saying it could not read the file / could not identify the image file. 当我尝试使用下面的代码提取数据时,出现错误,提示它无法读取文件/无法识别图像文件。

try:
    import Image
except ImportError:
    from PIL import Image

import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

# Simple image to string
#print(pytesseract.image_to_string(Image.open('C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Scripts\\1184.pdf')))

Traceback (most recent call last): 追溯(最近一次通话):

  File "C:\Users\Administrator\eclipse-workspace\tesseract\test\greetings.py", line 18, in <module>
    print(pytesseract.image_to_string(Image.open('C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Scripts\\1184.pdf')))
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\PIL\Image.py", line 2622, in open
    % (filename if filename else fp))
OSError: cannot identify image file 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Scripts\\1184.pdf'

Please help me how to go about 请帮我怎么做

Your trying to open a pdf file as an image. 您尝试打开PDF文件作为图像。 Its not possible pillow state in their docs that they do not support reading pdf file see: https://pillow.readthedocs.io/en/5.1.x/handbook/image-file-formats.html 他们的文档中不支持读取pdf文件的枕头状态不可能达到以下状态: https//pillow.readthedocs.io/en/5.1.x/handbook/image-file-formats.html

Anyhow you could tranform the pdf to an image with the library pdf2image and than open it with pillow and feed it to tesseract. 无论如何,您都可以使用库pdf2image将pdf转换为图像,然后用枕头将其打开并送入tesseract。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM