Tesseract showing gibberish

Question

I'm using the pytesseract library to create an OCR translation discord bot. But the output from tesseract is 90% complete gibberish, and I do not understand why.

The image that I try to use is already cropped to the area that I wish to use. I have tried to convert the image to grayscale via PIL but then pytesseract will output nothing.

I'm using the latest version of both pytesseract (0.2.7) and tesseract (v5 alpha)

I use the following code to get the image from the internet, pass it through tesseract and later (commented) translate the text.

from PIL import Image
import requests
import pytesseract
from io import BytesIO

from translate import Translator

translator = Translator(from_lang="autodetect", to_lang="en")

response = requests.get('https://image.prntscr.com/image/acqm3LDeSJOHtUZEMfA9eA.png')

#image = Image.open(BytesIO(response.content)).convert('LA')
image = Image.open(BytesIO(response.content))
string = pytesseract.image_to_string(image, lang='fra')
#image.save('greyscale.png')

print(string.format())

#translation = translator.translate(string)

#print(translation)

The output I get from tesseract can be found here: https://pastebin.com/kDYuTE4Q

I'm entirely new to both tesseract and python, so I may be doing something fundamental wrong, or I ask something from tesseract that is just not possible at the moment.

Answer 1

You get a lot of benefit by just inverting the image. Tesseract seems to prefer black text on a white background. Also, I got some improvement by increasing contrast.

from PIL import Image, ImageOps
import requests
import pytesseract
from io import BytesIO


pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

response = requests.get('https://image.prntscr.com/image/acqm3LDeSJOHtUZEMfA9eA.png')

image = Image.open(BytesIO(response.content))

if image.mode == 'RGBA':
    r,g,b,a = image.split()
    image = Image.merge('RGB', (r,g,b))

image = ImageOps.invert(image)

contrast = ImageEnhance.Contrast(image)
image = contrast.enhance(2)

config = ("--psm 6")

txt = pytesseract.image_to_string(image, config = config, lang='fra')

print(txt)

A little discussion. The original PNG image has an alpha channel that gives the invert operation problems. So we do a trick to split the image into individual channels and merge back into an RGB image. The ImageEnhance module has some wonky syntax to use, but it works and you can get the idea of how to use it.

Here's the output:

Je savais niveau 28
lly a Tl heures
Bonne nuit
lly a Tl heures
Je sors mon chien je reviens après
Ilya 1l heures
Je t en reprendrai un demain écris moi quand tu es en
ligne
Ilya 1l heures
Bonne nuit
y a f heures
Oki
Ily a T1 heures

Not bad. The timestamps aren't great, but if you look at the original image, the resolution on those letters is not great. But if you mess around with the image some more (contrast, threshold, etc) maybe you can get improved results.

Tesseract showing gibberish

Question

1 answers

solution1
0 2019-09-09 23:58:15

Tesseract showing gibberish

Question

1 answers

solution1 0 2019-09-09 23:58:15

solution1
0 2019-09-09 23:58:15