简体   繁体   English

如何使用 Pytesseract 文本识别提高 OCR?

[英]How to improve OCR with Pytesseract text recognition?

Hi I am looking to improve my performance with pytesseract at digit recognition.嗨,我希望通过 pytesseract 提高我在数字识别方面的表现。

I take my raw image and split it into parts that look like this:我将原始图像分成如下所示的部分:

图片1

The size can vary.大小可以变化。

To this I apply some pre-processing methods like so为此,我应用了一些像这样的预处理方法

image = cv2.imread(im, cv2.IMREAD_GRAYSCALE)
image = cv2.GaussianBlur(image, (1, 1), 0)
kernel = np.ones((5, 5), np.uint8)
result_img = cv2.blur(img, (2, 2), 0)
result_img = cv2.dilate(result_img, kernel, iterations=1)
result_img = cv2.erode(result_img, kernel, iterations=1)

and I get this我明白了

图片2

I then pass this to pytesseract:然后我将其传递给 pytesseract:

num = pytesseract.image_to_string(result_img, lang='eng',
                                     config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

However this is not good enough for me and often gets numbers wrong.然而,这对我来说还不够好,而且经常弄错数字。

I am looking for ways to improve, I have tried to keep this minimal and self contained but let me know if I've not been clear and I will elaborate.我正在寻找改进的方法,我试图保持这种最小化和自给自足,但如果我不清楚,请告诉我,我会详细说明。

Thank you.谢谢你。

You're on the right track by trying to preprocess the image before performing OCR but using an incorrect approach.通过在执行 OCR 之前尝试预处理图像但使用了不正确的方法,您走在正确的轨道上。 There is no reason to dilate or erode the image since these operations are mainly used for removing small noise particles.没有理由扩张或侵蚀图像,因为这些操作主要用于去除小的噪声粒子。 In addition, your current output is not a binary image.此外,您当前的输出不是二进制图像。 It may look like it only contains black and white pixels but it is actually a 3-channel BGR image which is probably why you're getting incorrect OCR results.它可能看起来只包含黑白像素,但它实际上是一个 3 通道 BGR 图像,这可能是您得到不正确 OCR 结果的原因。 If you look at Tesseract improve quality , you will notice that for Pytesseract to perform optimal OCR, the image needs to be preprocessed so that the desired text to detect is in black with the background in white .如果您查看Tesseract 提高质量,您会注意到 Pytesseract 要执行最佳 OCR,需要对图像进行预处理,以便要检测所需文本为黑色,背景为白色 To do this, we can perform a Otsu's threshold to obtain a binary image then invert it so the text is in the foreground.为此,我们可以执行Otsu 阈值以获得二值图像,然后将其反转,使文本位于前景中。 This will result in our preprocessed image where we can throw it into image_to_string .这将产生我们预处理的图像,我们可以将其放入image_to_string We use the --psm 6 configuration option to assume a single uniform block of text.我们使用--psm 6配置选项来假设一个统一的文本块。 Take a look at configuration options for more settings.查看更多设置的配置选项 Here's the results:结果如下:

Input image -> Binary -> Invert输入图像->二进制->反转

在此处输入图片说明 在此处输入图片说明 在此处输入图片说明

Result from Pytesseract OCR Pytesseract OCR 的结果

8

Code代码

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, grayscale, Otsu's threshold, invert
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
invert = 255 - thresh

# OCR
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('invert', invert)
cv2.waitKey()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM