简体   繁体   中英

Given an image like the one shown how would you suggest improving character recognition using pytesseract

The image i am testing with is that below.

在此处输入图片说明

I am very new to OCR and wondered what sort of techniques I could apply to try and improve accuracy of the method in python, probably using PIL but open to suggestions. With the raw image used there are no characters recognised at all.

Apologies if the question is a little open ended but as I mentioned, very knew to OCR in general.

edit 1: as per suggestion here is the code I have so far:

from PIL import Image
import cv2
import pytesseract
image_file=Image.open('rsTest.jpg')
image_file=image_file.convert('1')
image_file.save('PostPro.jpg',dpi=(400,400))
image_file.show

new_image=Image.open('PostPro.jpg')
print pytesseract.image_to_string(new_image)

How constant are your images? In case they all look like the one you posted, what you need to do first is to crop it:

#Since you are importing cv2
image_file=cv.imread('rsTest.jpg')
crop_image = full_image[start_y:end_y,start_x:end_x]

Then you can just keep the white (which are the letters and turn everything else to black.

crop_image[np.where((crop_image != [255,255,255]).all(axis = 2))] = [0,0,0]

Then apply OCR with tesseract

img = Image.fromarray(crop_image)
captchaText = pytesseract.image_to_string(img)

You would need to import cv2, numpy, pytesseract and PIL.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM