[英]Python - Improving Tesseract OCR to recognize list of names
I'm working on a project that will recognize teams in a game (Overwatch) and record which players were on which team.我正在做一个项目,该项目将识别游戏(守望先锋)中的团队并记录哪些玩家在哪个团队中。 It has a predefined list of who is playing, it only needs to recognize which image they are located on.
它有一个预定义的播放列表,它只需要识别他们所在的图像。 So far I have had success in capturing the images for each team and getting a rough output as to the name for each player, however, it is getting several letters confused.
到目前为止,我已经成功地为每支球队拍摄了图像并粗略地输出了每个球员的名字,但是,它让几个字母混淆了。
My input images:我的输入图像:
And the output I get from OCR:我从 OCR 得到的输出:
W THEMIGHTVMRT
ERSVZENVRTTR
ERSVLUCID
ERSVZRRVR
ERSVMEI
EFISVSDMBRR
ERSV RNR
ERSVZENVRTTR
EFISVZHRVR
ERSVMCCREE
ERSVMEI
EHSVRDRDHDG
From this, you can see that the OCR confuses "A" with "R" and "Y" with "V".由此,您可以看到 OCR 将“A”与“R”和“Y”与“V”混淆。 I was able to get the font file that Overwatch uses and generate a .traineddata file using Train Your Tesseract - I'm aware that there is probably a better way of generating this file, though I'm not sure how.
我能够获得守望先锋使用的字体文件并使用训练你的 Tesseract生成一个 .traineddata 文件 - 我知道可能有更好的方法来生成这个文件,但我不确定如何。
My code:我的代码:
from pytesseract import *
import pyscreenshot
pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
team1 = pyscreenshot.grab(bbox=(50,450,530,810)) # X1, Y1, X2, Y2
team1.save("team1screenshot.png")
team1text = pytesseract.image_to_string(team1, config=tessdata_dir_config, lang='owf')
team2 = pyscreenshot.grab(bbox=(800,450,1280,810)) # X1, Y1, X2, Y2
team2.save("team2screenshot.png")
team2text = pytesseract.image_to_string(team2, config=tessdata_dir_config, lang='owf')
print(team1text)
print("------------------")
print(team2text)
How should I improve the recognition of these characters?我应该如何提高对这些字符的识别? Do I need a better .traineddata file, or is it regarding better image processing?
我需要更好的 .traineddata 文件,还是关于更好的图像处理?
Thanks for any help!感谢您的帮助!
正如@FlorianBrucker 所提到的,对字符串进行相似性测试可以(通过一些微调)在 OCR 级别之后找到正确的字符串。
You could try custom OCR configs to do a sparse text search, "Find as much text as possible in no particular order."您可以尝试使用自定义 OCR 配置来进行稀疏文本搜索,“以无特定顺序查找尽可能多的文本”。
SET psm to 11 in tesseract configs在 tesseract 配置中将 psm 设置为 11
See if you can do this:看看你是否可以这样做:
tessdata_dir_config = "--oem 3 --psm 11"
To see a complete list of supported page segmentation modes (psm), use tesseract -h.要查看支持的页面分段模式 (psm) 的完整列表,请使用 tesseract -h。 Here's the list as of 3.21:
这是截至 3.21 的列表:
I'm using python wrapper for Tesseract https://github.com/madmaze/pytesseract我正在为 Tesseract https://github.com/madmaze/pytesseract使用 python 包装器
Here you can configure tesseract as:在这里,您可以将 tesseract 配置为:
custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.