简体   繁体   English

Tesseract显示乱码

[英]Tesseract showing gibberish

I'm using the pytesseract library to create an OCR translation discord bot. 我正在使用pytesseract库来创建OCR翻译不一致机器人。 But the output from tesseract is 90% complete gibberish, and I do not understand why. 但是tesseract的输出完全是乱码的90%,我不明白为什么。

The image that I try to use is already cropped to the area that I wish to use. 我尝试使用的图像已被裁剪到我要使用的区域。 I have tried to convert the image to grayscale via PIL but then pytesseract will output nothing. 我试图通过PIL将图像转换为灰度,但是pytesseract将不输出任何内容。

I'm using the latest version of both pytesseract (0.2.7) and tesseract (v5 alpha) 我正在使用pytesseract(0.2.7)和tesseract(v5 alpha)的最新版本

I use the following code to get the image from the internet, pass it through tesseract and later (commented) translate the text. 我使用以下代码从互联网获取图像,将其通过tesseract传递,然后(注释)翻译文本。

from PIL import Image
import requests
import pytesseract
from io import BytesIO

from translate import Translator

translator = Translator(from_lang="autodetect", to_lang="en")

response = requests.get('https://image.prntscr.com/image/acqm3LDeSJOHtUZEMfA9eA.png')

#image = Image.open(BytesIO(response.content)).convert('LA')
image = Image.open(BytesIO(response.content))
string = pytesseract.image_to_string(image, lang='fra')
#image.save('greyscale.png')

print(string.format())

#translation = translator.translate(string)

#print(translation)

The output I get from tesseract can be found here: https://pastebin.com/kDYuTE4Q 我从tesseract获得的输出可以在这里找到: https : //pastebin.com/kDYuTE4Q

I'm entirely new to both tesseract and python, so I may be doing something fundamental wrong, or I ask something from tesseract that is just not possible at the moment. 我对tesseract和python都是全新的,所以我可能在做一些根本性的错误,或者我从tesseract提出了一些目前尚无法解决的问题。

You get a lot of benefit by just inverting the image. 只需反转图像即可获得很多好处。 Tesseract seems to prefer black text on a white background. Tesseract似乎更喜欢在白色背景上的黑色文本。 Also, I got some improvement by increasing contrast. 另外,通过增加对比度,我得到了一些改进。

from PIL import Image, ImageOps
import requests
import pytesseract
from io import BytesIO


pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

response = requests.get('https://image.prntscr.com/image/acqm3LDeSJOHtUZEMfA9eA.png')

image = Image.open(BytesIO(response.content))

if image.mode == 'RGBA':
    r,g,b,a = image.split()
    image = Image.merge('RGB', (r,g,b))

image = ImageOps.invert(image)

contrast = ImageEnhance.Contrast(image)
image = contrast.enhance(2)

config = ("--psm 6")

txt = pytesseract.image_to_string(image, config = config, lang='fra')

print(txt)

A little discussion. 一点讨论。 The original PNG image has an alpha channel that gives the invert operation problems. 原始的PNG图像具有一个alpha通道,该通道给出了反转操作问题。 So we do a trick to split the image into individual channels and merge back into an RGB image. 因此,我们采取了一种技巧,将图像分成单独的通道,然后合并回RGB图像。 The ImageEnhance module has some wonky syntax to use, but it works and you can get the idea of how to use it. ImageEnhance模块具有一些难以理解的语法,但可以使用,并且您可以了解如何使用它。

Here's the output: 这是输出:

Je savais niveau 28
lly a Tl heures
Bonne nuit
lly a Tl heures
Je sors mon chien je reviens après
Ilya 1l heures
Je t en reprendrai un demain écris moi quand tu es en
ligne
Ilya 1l heures
Bonne nuit
y a f heures
Oki
Ily a T1 heures

Not bad. 不错。 The timestamps aren't great, but if you look at the original image, the resolution on those letters is not great. 时间戳记不是很好,但是如果您查看原始图像,则这些字母的分辨率不是很好。 But if you mess around with the image some more (contrast, threshold, etc) maybe you can get improved results. 但是,如果您对图像进行更多处理(对比度,阈值等),也许可以获得更好的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM