简体   繁体   中英

pytesseract and image.tif file

I need to transcribe an image.tif with several pages to text using pytesseract. I have the next code:

> From PIL import Image
> Import pytesseract
> Pytesseract.pytesseract.tesseract_cmd = 'C: / Program Files (x86) / Tesseract-
> OCR / tesseract '
> Print (pytesseract.image_to_string (Image.open ('CAMARA.tif'), lang = "spa"))

The problem is that only extract the firs page. How can i extract all of them?

I was able to fix the same problem by calling the method convert() as below

image = Image.open(imagePath).convert("RGBA")
text = pytesseract.image_to_string(image)
print(text)

I guess you have mentioned only one image "camara.tif" , First you have to convert all the pdf pages into images you can see this link for doing so.

And next use pytesseract to loop over images one by one to extract text from image.

I just stumbled over the same problem... what you could do is call tesseract directly

# test.py
import subprocess

in_filename = 'file_0.tiff'
out_filename = 'out'
lang = 'spa'
subprocess.call(['tesseract', in_filename, '-l', lang, out_filename ])

would process all pages

$ python test.py 
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
Page 2
Page 3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM