how to extract text from images in a pdf file using pytesseract

Question

I am trying to use below code for extracting text from images of a pdf file. The PDF file is a Contract Document which is a scanned copy of a Contract. All the pages in the pdf file are images.

When I tried using the below code to extract data, I am getting an error saying it could not read the file / could not identify the image file.

try:
    import Image
except ImportError:
    from PIL import Image

import pytesseract

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

# Simple image to string
#print(pytesseract.image_to_string(Image.open('C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Scripts\\1184.pdf')))

Traceback (most recent call last):

  File "C:\Users\Administrator\eclipse-workspace\tesseract\test\greetings.py", line 18, in <module>
    print(pytesseract.image_to_string(Image.open('C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Scripts\\1184.pdf')))
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python37\lib\site-packages\PIL\Image.py", line 2622, in open
    % (filename if filename else fp))
OSError: cannot identify image file 'C:\\Users\\Administrator\\AppData\\Local\\Programs\\Python\\Python37\\Scripts\\1184.pdf'

Please help me how to go about

Answer 1

Your trying to open a pdf file as an image. Its not possible pillow state in their docs that they do not support reading pdf file see: https://pillow.readthedocs.io/en/5.1.x/handbook/image-file-formats.html

Anyhow you could tranform the pdf to an image with the library pdf2image and than open it with pillow and feed it to tesseract.

how to extract text from images in a pdf file using pytesseract

Question

1 answers

solution1
1 2018-09-26 20:35:23

how to extract text from images in a pdf file using pytesseract

Question

1 answers

solution1 1 2018-09-26 20:35:23

solution1
1 2018-09-26 20:35:23