简体   繁体   English

为什么 pytesseract 无法识别背景较暗的图像中的数字?

[英]Why does pytesseract fail to recognise digits from image with darker background?

I've this python code which I use to convert a text written in a picture to a string, it does work for certain images which have large characters, but not for the one I'm trying right now which contains only digits.我有这个 python 代码,我用来将写在图片中的文本转换为字符串,它确实适用于某些具有大字符的图像,但不适用于我现在正在尝试的仅包含数字的图像。

This is the picture:这是图片:

数字

This is my code:这是我的代码:

import pytesseract
from PIL import Image

img = Image.open('img.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
result = pytesseract.image_to_string(img)
print (result)

Why is it failing at recognising this specific image and how can I solve this problem?为什么无法识别此特定图像,我该如何解决此问题?

I have two suggestions.我有两个建议。

First, and this is by far the most important, in OCR preprocessing images is key to obtaining good results.首先,这是迄今为止最重要的,在 OCR 中预处理图像是获得良好结果的关键。 In your case I suggest binarization.在你的情况下,我建议二值化。 Your images look extremely good so you shouldn't have any problem but if you do, then maybe you should try to binarize your images:您的图像看起来非常好,所以您不应该有任何问题,但如果您有问题,那么也许您应该尝试对图像进行二值化:

import cv2
from PIL import Image

img = cv2.imread('gradient.png')
# If your image is not already grayscale :
# img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
threshold = 180 # to be determined
_, img_binarized = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
pil_img = Image.fromarray(img_binarized)

And then try the ocr again with the binarized image.然后使用二值化图像再次尝试 ocr。

Check if your image is in grayscale and uncomment if needed.检查您的图像是否为灰度图像,并在需要时取消注释。

This is simple thresholding.这是简单的阈值。 Adaptive thresholding also exists but it is noisy and does not bring anything in your case.自适应阈值也存在,但它很嘈杂,并且不会为您带来任何好处。

Binarized images will be much easier for Tesseract to handle. Tesseract 更容易处理二值化图像。 This is already done internally ( https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality ) but sometimes things can be messed up and very often it's useful to do your own preprocessing.这已经在内部完成( https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality )但有时事情可能会搞砸,而且通常进行自己的预处理很有用。

You can check if the threshold value is right by looking at the images :您可以通过查看图像来检查阈值是否正确:

import matplotlib.pyplot as plt
plt.imshow(img, cmap='gray')
plt.imshow(img_binarized, cmap='gray')

Second, if what I said above still doesn't work, I know this doesn't answer "why doesn't pytesseract work here" but I suggest you try out tesserocr.其次,如果我上面说的仍然不起作用,我知道这不能回答“为什么 pytesseract 在这里不起作用”,但我建议您尝试使用 tesserocr。 It is a maintained python wrapper for Tesseract.它是 Tesseract 的维护 Python 包装器。

You could try:你可以试试:

import tesserocr
text_from_ocr = tesserocr.image_to_text(pil_img)

Here is the doc for tesserocr from pypi : https://pypi.org/project/tesserocr/这是来自 pypi 的 tesserocr 文档: https ://pypi.org/project/tesserocr/

And for opencv : https://pypi.org/project/opencv-python/而对于 opencv: https : //pypi.org/project/opencv-python/

As a side-note, black and white is treated symetrically in Tesseract so having white digits on a black background is not a problem.附带说明一下,在 Tesseract 中,黑色和白色是对称处理的,因此黑色背景上的白色数字不是问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM