Python - 改进 Tesseract OCR 以识别名称列表

Question

I'm working on a project that will recognize teams in a game (Overwatch) and record which players were on which team.我正在做一个项目，该项目将识别游戏（守望先锋）中的团队并记录哪些玩家在哪个团队中。 It has a predefined list of who is playing, it only needs to recognize which image they are located on.它有一个预定义的播放列表，它只需要识别他们所在的图像。 So far I have had success in capturing the images for each team and getting a rough output as to the name for each player, however, it is getting several letters confused.到目前为止，我已经成功地为每支球队拍摄了图像并粗略地输出了每个球员的名字，但是，它让几个字母混淆了。

My input images:我的输入图像：

And the output I get from OCR:我从 OCR 得到的输出：

W THEMIGHTVMRT
ERSVZENVRTTR
ERSVLUCID
ERSVZRRVR
ERSVMEI
EFISVSDMBRR

ERSV RNR
ERSVZENVRTTR
EFISVZHRVR
ERSVMCCREE
ERSVMEI
EHSVRDRDHDG

From this, you can see that the OCR confuses "A" with "R" and "Y" with "V".由此，您可以看到 OCR 将“A”与“R”和“Y”与“V”混淆。 I was able to get the font file that Overwatch uses and generate a .traineddata file using Train Your Tesseract - I'm aware that there is probably a better way of generating this file, though I'm not sure how.我能够获得守望先锋使用的字体文件并使用训练你的 Tesseract生成一个 .traineddata 文件 - 我知道可能有更好的方法来生成这个文件，但我不确定如何。

My code:我的代码：

    from pytesseract import *
    import pyscreenshot

    pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
    tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'

    team1 = pyscreenshot.grab(bbox=(50,450,530,810)) # X1, Y1, X2, Y2
    team1.save("team1screenshot.png")
    team1text = pytesseract.image_to_string(team1, config=tessdata_dir_config, lang='owf')

    team2 = pyscreenshot.grab(bbox=(800,450,1280,810)) # X1, Y1, X2, Y2
    team2.save("team2screenshot.png")
    team2text = pytesseract.image_to_string(team2, config=tessdata_dir_config, lang='owf')

    print(team1text)
    print("------------------")
    print(team2text)

How should I improve the recognition of these characters?我应该如何提高对这些字符的识别？ Do I need a better .traineddata file, or is it regarding better image processing?我需要更好的 .traineddata 文件，还是关于更好的图像处理？

Thanks for any help!感谢您的帮助！

Answer 1

正如@FlorianBrucker 所提到的，对字符串进行相似性测试可以（通过一些微调）在 OCR 级别之后找到正确的字符串。

Answer 2

You could try custom OCR configs to do a sparse text search, "Find as much text as possible in no particular order."您可以尝试使用自定义 OCR 配置来进行稀疏文本搜索，“以无特定顺序查找尽可能多的文本”。

SET psm to 11 in tesseract configs在 tesseract 配置中将 psm 设置为 11

See if you can do this:看看你是否可以这样做：

tessdata_dir_config = "--oem 3 --psm 11"

To see a complete list of supported page segmentation modes (psm), use tesseract -h.要查看支持的页面分段模式 (psm) 的完整列表，请使用 tesseract -h。 Here's the list as of 3.21:这是截至 3.21 的列表：

Orientation and script detection (OSD) only.仅限方向和脚本检测 (OSD)。
Automatic page segmentation with OSD.带有 OSD 的自动页面分割。
Automatic page segmentation, but no OSD, or OCR.自动页面分割，但没有 OSD 或 OCR。
Fully automatic page segmentation, but no OSD.全自动页面分割，但没有 OSD。 (Default) （默认）
Assume a single column of text of variable sizes.假设有一列可变大小的文本。
Assume a single uniform block of vertically aligned text.假设有一个统一的垂直对齐文本块。
Assume a single uniform block of text.假设有一个统一的文本块。
Treat the image as a single text line.将图像视为单个文本行。
Treat the image as a single word.将图像视为一个词。
Treat the image as a single word in a circle.将图像视为圆圈中的单个单词。
Treat the image as a single character.将图像视为单个字符。
Sparse text.稀疏文本。 Find as much text as possible in no particular order.查找尽可能多的文本，没有特定的顺序。
Sparse text with OSD.带有 OSD 的稀疏文本。
Raw line.原始线。 Treat the image as a single text line, bypassing hacks that are Tesseract-specific.将图像视为单个文本行，绕过 Tesseract 特定的黑客攻击。

I'm using python wrapper for Tesseract https://github.com/madmaze/pytesseract我正在为 Tesseract https://github.com/madmaze/pytesseract使用 python 包装器

Here you can configure tesseract as:在这里，您可以将 tesseract 配置为：

custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)

Python - 改进 Tesseract OCR 以识别名称列表

问题描述

2 个解决方案

解决方案1
0 2017-07-13 11:51:29

解决方案2
0 2021-01-15 10:55:20

Python - 改进 Tesseract OCR 以识别名称列表

问题描述

2 个解决方案

解决方案1 0 2017-07-13 11:51:29

解决方案2 0 2021-01-15 10:55:20

解决方案1
0 2017-07-13 11:51:29

解决方案2
0 2021-01-15 10:55:20