简体   繁体   English

Tesseract 无法读取图像中的数字(验证码)

[英]Tesseract unable to read the digits in an image (Captcha)

I have this image: Unprocessed Image From the code below, I was able to convert it to this: Processed Image我有这个图像:未处理的图像从下面的代码中,我能够将其转换为:已处理的图像

The image has the number: 8276 But my code reads it as 776图像的编号为: 8276但我的代码将其读取为776

How can I successfully make my code be able to read it as 8276 ?如何成功地使我的代码能够将其读取为8276 I am very new at this image processing/cv2/pytesseract and upon too much searching was able to get this far.我对这个图像处理/cv2/pytesseract 非常陌生,经过过多的搜索才能够走到这一步。

import cv2
import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = r'C:\Users\hamza.rana\AppData\Local\Tesseract-OCR\tesseract.exe'

image = cv2.imread('captcha.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
gray = cv2.medianBlur(gray, 3)
gray = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
filename = "{}.png".format("temp")
cv2.imwrite(filename, gray)
text = pytesseract.image_to_string(Image.open('temp.png'),config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)

The out-of-the-box training on Tesseract works best for typefaces, and (in my experience) poorly on hand printing, and forget it on long hand script. Tesseract 的开箱即用培训最适合字体,并且(以我的经验)手工打印效果不佳,而在长手写体上忘记了它。

On thing that helps slightly when things get tight is to expand the border by a few pixels.当事情变得紧张时,稍微有帮助的事情是将边框扩大几个像素。 But starting with a messy capcha... That's something you might have to train a model for.但是从一个凌乱的 capcha 开始......这可能是你必须训练模型的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM