简体   繁体   English

Python-Pytesseract从图像中提取不正确的文本

[英]Python - Pytesseract extracts incorrect text from image

I used the below code in Python to extract text from image, 我在Python中使用以下代码从图像中提取文本,

import cv2
import numpy as np
import pytesseract
from PIL import Image

# Path of working folder on Disk
src_path = "<dir path>"

def get_string(img_path):
    # Read image with opencv
    img = cv2.imread(img_path)

    # Convert to gray
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)

    # Write image after removed noise
    cv2.imwrite(src_path + "removed_noise.png", img)

    #  Apply threshold to get image with only black and white
    #img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

    # Write the image after apply opencv to do some ...

    cv2.imwrite(src_path + "thres.png", img)

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

    # Remove template file
    #os.remove(temp)

    return result


print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")

print "------ Done -------"

But the output is incorrect.. The input file is, 但是输出不正确。输入文件是

在此处输入图片说明

The output received is '0001' instead of 'D001' 收到的输出是“ 0001”而不是“ D001”

在此处输入图片说明

The output received is '3001' instead of 'B001' 收到的输出是“ 3001”而不是“ B001”

What is the required code changes to retrieve the right Characters from image, also to train the pytesseract to return the right characters for all font types in image[including Bold characters] 需要什么代码更改才能从图像中检索正确的字符,还训练pytesseract返回图像中所有字体类型的正确字符[包括粗体字符]

@Maaaaa has pointed out the exact reason for incorrect text recognition by Tessearact. @Maaaaa指出了Tessearact无法正确识别文本的确切原因。

But still you can improve your final output by applying some post processing steps on the tesseract output. 但是,仍然可以通过在tesseract输出上应用一些后处理步骤来提高最终输出。 Here are a few points that you can think about and use them if it helps: 如果有帮助,您可以考虑并使用以下几点:

  1. Try disabling the dictionary check feature in Tesseract input parameters. 尝试在Tesseract输入参数中禁用字典检查功能。
  2. Use heuristic based information from your dataset. 使用数据集中基于启发式的信息。 From the given sample images in question, i guess first character of each word/sequence is an alphabet so you can replace first digit in your output with most probable alphabet based on your dataset, for example '0' can be replaced with D so '0001' -> 'D001', similarly for other cases too. 从有问题的给定样本图像中,我想每个单词/序列的第一个字符是字母,因此您可以根据数据集用最可能的字母替换输出中的第一个数字,例如可以用D替换“ 0”,因此“ 0001'->'D001',对于其他情况也是如此。
  3. Tesseract also provides the character level recognition confidence value, so use that information to replace the characters with the one having highest confidence value. Tesseract还提供了字符级别识别置信度值,因此请使用该信息以具有最高置信度值的字符替换字符。

Try different config parameters in below line 在下面的行中尝试不同的配置参数

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

Like as shown below: 如下图所示:

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')

在此处输入图片说明

Try to change the psm value and compare the results 尝试更改psm值并比较结果

-- Good Luck -- - 祝好运 -

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM