Python-Pytesseract从图像中提取不正确的文本

Question

我在Python中使用以下代码从图像中提取文本，

import cv2
import numpy as np
import pytesseract
from PIL import Image

# Path of working folder on Disk
src_path = "<dir path>"

def get_string(img_path):
    # Read image with opencv
    img = cv2.imread(img_path)

    # Convert to gray
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)

    # Write image after removed noise
    cv2.imwrite(src_path + "removed_noise.png", img)

    #  Apply threshold to get image with only black and white
    #img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

    # Write the image after apply opencv to do some ...

    cv2.imwrite(src_path + "thres.png", img)

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

    # Remove template file
    #os.remove(temp)

    return result


print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")

print "------ Done -------"

但是输出不正确。输入文件是

收到的输出是“ 0001”而不是“ D001”

收到的输出是“ 3001”而不是“ B001”

需要什么代码更改才能从图像中检索正确的字符，还训练pytesseract返回图像中所有字体类型的正确字符[包括粗体字符]

Answer 1

@Maaaaa指出了Tessearact无法正确识别文本的确切原因。

但是，仍然可以通过在tesseract输出上应用一些后处理步骤来提高最终输出。 如果有帮助，您可以考虑并使用以下几点：

尝试在Tesseract输入参数中禁用字典检查功能。
使用数据集中基于启发式的信息。 从有问题的给定样本图像中，我想每个单词/序列的第一个字符是字母，因此您可以根据数据集用最可能的字母替换输出中的第一个数字，例如可以用D替换“ 0”，因此“ 0001'->'D001'，对于其他情况也是如此。
Tesseract还提供了字符级别识别置信度值，因此请使用该信息以具有最高置信度值的字符替换字符。

Answer 2

在下面的行中尝试不同的配置参数

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

如下图所示：

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')

尝试更改psm值并比较结果

- 祝好运 -

Python-Pytesseract从图像中提取不正确的文本

问题描述

2 个解决方案

解决方案1
2 2018-04-13 06:41:57

解决方案2
0 2018-08-22 10:27:30

Python-Pytesseract从图像中提取不正确的文本

问题描述

2 个解决方案

解决方案1 2 2018-04-13 06:41:57

解决方案2 0 2018-08-22 10:27:30

解决方案1
2 2018-04-13 06:41:57

解决方案2
0 2018-08-22 10:27:30