简体   繁体   English

如何在小图像中使用 pytesseract 改进手写文本识别?

[英]How to improve handwritten text recognition using pytesseract in small image?

I want to do handwritten text recognition using the pytesseract library to read a numerical character in images that has an average dimension of 43 * 45 pixels.我想使用pytesseract库进行手写文本识别,以读取平均尺寸为 43 * 45 像素的图像中的数字字符。 The following sample image:以下示例图像:
图 1 图 2 图 3

expected result:预期结果:

9
1
4

I want to get a single numerical character from the image.我想从图像中获取单个数字字符。

I've tried this code below我在下面尝试过这段代码

import pytesseract

# loop through images
print(pytesseract.image_to_string("text.jpg", config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789'))

but the real result, I got less than 50% of accuracy or even much lower, some numbers that read well, some that read 2 characters in a single image, some that didn't read.但真正的结果,我得到的准确率不到 50% 甚至更低,有些数字读得很好,有些数字在单个图像中读取了 2 个字符,有些没有读。
When I remove the -c tessedit_char_whitelist = 0123456789 configuration, I get the characters 4 , \ , and the letter g .当我删除-c tessedit_char_whitelist = 0123456789配置时,我得到字符4\和字母g
How to make Pytesseract treat images as an only single numerical character instead of using a whitelist that still reads the text as alphanumeric ?如何使 Pytesseract 将图像视为唯一的单个数字字符,而不是使用仍将文本读取为字母数字的白名单?

PS: I know that OCR is can't 100% accurate. PS:我知道 OCR 不能 100% 准确。 At least the accuracy can be improved.至少可以提高准确性。

Accordingly to this GitHub issue , tesseract 4.0 does not support whitelist characters with the LSTM model.根据此 GitHub 问题,tesseract 4.0 不支持 LSTM model 的白名单字符。 You can fix this issue by upgrading Tesseract to the 4.1 version instead of using the legacy model (ie, --oem flag).您可以通过将 Tesseract 升级到 4.1 版本而不是使用旧版 model(即--oem标志)来解决此问题。

Alternatively, you could try to use the flag config='digits' as proposed by Robert Harris in this answer to force your pytesseract into returning only digits.或者,您可以尝试使用Robert Harris此答案中提出的标志config='digits'来强制您的 pytesseract 只返回数字。

This blog article proposes the creation of a python function that uses a simple regex to extract all numbers instead of juggling around with several flags and versions. 这篇博客文章建议创建一个 python function,它使用一个简单的正则表达式来提取所有数字,而不是使用几个标志和版本来处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM