简体   繁体   中英

how to extract text from images in a pdf file using pytesseract

I am trying to use below code for extracting text from images of a pdf file. The PDF file is a Contract Document which is a scanned copy of a Contract. All the pages in the pdf file are images.

When I tried using the below code to extract data, I am getting an error saying it could not read the file / could not identify the image file.

try:
    import Image
except ImportError:
    from PIL import Image

import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

# Simple image to string
#print(pytesseract.image_to_string(Image.open('C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Scripts\\1184.pdf')))

Traceback (most recent call last):

  File "C:\Users\Administrator\eclipse-workspace\tesseract\test\greetings.py", line 18, in <module>
    print(pytesseract.image_to_string(Image.open('C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Scripts\\1184.pdf')))
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\PIL\Image.py", line 2622, in open
    % (filename if filename else fp))
OSError: cannot identify image file 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Scripts\\1184.pdf'

Please help me how to go about

Your trying to open a pdf file as an image. Its not possible pillow state in their docs that they do not support reading pdf file see: https://pillow.readthedocs.io/en/5.1.x/handbook/image-file-formats.html

Anyhow you could tranform the pdf to an image with the library pdf2image and than open it with pillow and feed it to tesseract.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM