I need help in single uppercase character recognition using pytesseract

Question

I have this picture of characters evenly separated:

and using cv2 I inverted it to this:

and did some contouring around the letters to help the OCR. But when I run the image_to_string, the text I'm left with has some lines almost completely missing.

E
IN IA
ES
RVMARABILLARRBAGAZ
EARAVARGQNGUESUSAV
ANNA
AQCOOLLEMREVVCEGAO
ZUVAGOLEBONNABAL XL
REOORMOBILEJAHABAQ
IE II
VRBAONVTVFORÑEBIEP
O00EGREELOVCAVRDLA
A
IN A
EOLREBELAROSBTLVAS
TI
A |

For the output I'm using data = pytesseract.image_to_string(invimage, lang='spa',config='--psm 6') , in spanish to get the "Ñ" char. Any tips on what I'm doing wrong?

Answer 1

I too am a new contributor, so please forgive me for any kind of misleadings or incorrect answers. I have tried to extract the text from your image and the results were pretty good. here is the output image with bounding boxes

I have used image_to_data function instead of image_to_string to get the confidence value of each line of text.

Output:

QCCOVARDECRATOBHÍv
CHIBOVZINREVÁVRWTOI # recognized an extra O at the last
VULTOOGCONVOIBORGO
RVMARABILLARRBAGAZ
EARAVARGOQONGUESUSA
V
BSVKOZNAVARAGVÚCTL
AQCOOLLEMREVVCEGAO
ZUVAGOLEBONNABALXL
REODORMOBILEJAHABACQ
EIBBTAODORVICAAOSVR
VRBAONVTVFORÑEBIER
OO0ODEGREELOVCAVRDLA
GBCBTOTBLEOOATXMIAQ
SVALAVANELVOILOVNJ
EOLREBELAROSBTLVAS
VASTORETAVALEARTYW
ADOVNGRAVATAMJREÓ
Í

Still, there are a few incorrect recognitions like the Spanish-U in the 5th line of the image. Tesseract even added a few characters.

Here is the code in python:

custom_oem_psm_config =  r'--oem 3 --psm 6'
ocr = pytesseract.image_to_data(otsu, output_type=Output.DICT,config=custom_oem_psm_config,lang='spa')
boxes = len(ocr['text'])
texts = []
for i in range(boxes):
    if (int(ocr['conf'][i]) != -1):
        (x,y,w,h) = (ocr['left'][i],ocr['top'][i],ocr['width'][i],ocr['height'][i])
        cv2.rectangle(img_copy,(x,y),(x+w,y+h),(255,0,0),2)
        texts.append(ocr['text'][i])
    
def list_to_string(list):
    str1 = "\n"
    return str1.join(list)

string = list_to_string(texts)
print("String: ",string)

Thank you

I need help in single uppercase character recognition using pytesseract

Question

1 answers

solution1
0 2020-07-08 08:38:34

I need help in single uppercase character recognition using pytesseract

Question

1 answers

solution1 0 2020-07-08 08:38:34

solution1
0 2020-07-08 08:38:34