Python-Pytesseract從圖像中提取不正確的文本

Question

我在Python中使用以下代碼從圖像中提取文本，

import cv2
import numpy as np
import pytesseract
from PIL import Image

# Path of working folder on Disk
src_path = "<dir path>"

def get_string(img_path):
    # Read image with opencv
    img = cv2.imread(img_path)

    # Convert to gray
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)

    # Write image after removed noise
    cv2.imwrite(src_path + "removed_noise.png", img)

    #  Apply threshold to get image with only black and white
    #img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

    # Write the image after apply opencv to do some ...

    cv2.imwrite(src_path + "thres.png", img)

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

    # Remove template file
    #os.remove(temp)

    return result


print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")

print "------ Done -------"

但是輸出不正確。輸入文件是

收到的輸出是“ 0001”而不是“ D001”

收到的輸出是“ 3001”而不是“ B001”

需要什么代碼更改才能從圖像中檢索正確的字符，還訓練pytesseract返回圖像中所有字體類型的正確字符[包括粗體字符]

Answer 1

@Maaaaa指出了Tessearact無法正確識別文本的確切原因。

但是，仍然可以通過在tesseract輸出上應用一些后處理步驟來提高最終輸出。 如果有幫助，您可以考慮並使用以下幾點：

嘗試在Tesseract輸入參數中禁用字典檢查功能。
使用數據集中基於啟發式的信息。 從有問題的給定樣本圖像中，我想每個單詞/序列的第一個字符是字母，因此您可以根據數據集用最可能的字母替換輸出中的第一個數字，例如可以用D替換“ 0”，因此“ 0001'->'D001'，對於其他情況也是如此。
Tesseract還提供了字符級別識別置信度值，因此請使用該信息以具有最高置信度值的字符替換字符。

Answer 2

在下面的行中嘗試不同的配置參數

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

如下圖所示：

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')

嘗試更改psm值並比較結果

- 祝好運 -

Python-Pytesseract從圖像中提取不正確的文本

問題描述

2 個解決方案

解決方案1
2 2018-04-13 06:41:57

解決方案2
0 2018-08-22 10:27:30

Python-Pytesseract從圖像中提取不正確的文本

問題描述

2 個解決方案

解決方案1 2 2018-04-13 06:41:57

解決方案2 0 2018-08-22 10:27:30

解決方案1
2 2018-04-13 06:41:57

解決方案2
0 2018-08-22 10:27:30