简体   繁体   中英

Python - Image to text enclosed in pentagon shape pytesseract

I am trying to ready Energy Efficiency Rating from EPC certificate using python. Usually EPC certificate comes in PDF format. I have converted PDF into image already and using pytesseract to get text from image. However I am not getting expected results.

Sample Image: 在此处输入图片说明

Expected output: Current rating : 79, Potential rating : 79

What I have tried so far:

from pdf2image import convert_from_path
import pytesseract
from PIL import Image

pages = convert_from_path(r'my_file.pdf', 500)
img =pages[0].save(r'F:\Freelancer\EPC rating\fwdepcs\out.jpg', 'JPEG')
text = pytesseract.image_to_string(Image.open(r'F:\Freelancer\EPC rating\fwdepcs\out.jpg'))

However text does not capture 79.

I also tried cv2 pattern matching and shape detection, but those not worked for other reasons.

You say that you have convert this pdf to image file.

Use PIL(.crop()) or opencv to crop picture.And crop it like this:

在此处输入图片说明

And use PIL Image.convert("1") ,maybe tesseract can catch this number. If not,I think you can use jTessBoxEditor to train tesseract.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM