pytesseract 僅使用 tesseract 4.0 數字不起作用

Question

有人試圖在 python 中只調用最新版本的 tesseract 4.0 來獲取數字嗎？

下面在 3.05 中工作，但在 4.0 中仍然返回字符，我嘗試刪除所有配置文件但數字文件，但仍然無法正常工作； 任何幫助都會很棒：

im 是日期的圖像，黑色文本白色背景：

import pytesseract
im =  imageOfDate
im = pytesseract.image_to_string(im, config='outputbase digits')
print(im)

Answer 1

您可以在tessedit_char_whitelist指定數字作為config option ，如下所示。

ocr_result = pytesseract.image_to_string(image, lang='eng', boxes=False, \
           config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

希望這有幫助。

Answer 2

在 pytesseract 中使用 tessedit_char_whitelist 標志對我不起作用。 但是，一種解決方法是使用一個有效的標志，即 config='digits'：

import pytesseract
text = pytesseract.image_to_string(pixels, config='digits')

其中像素是圖像的 numpy 數組（PIL 圖像也應該有效）。 這應該會強制您的 pytesseract 只返回數字。 現在，要自定義它返回的內容，請找到您的數字配置文件，在 Windows 上，我的位於此處：

C:\\Program Files (x86)\\Tesseract-OCR\\tessdata\\configs

打開數字文件並添加您想要的任何字符。 保存並運行 pytesseract 后，它應該只返回那些自定義字符。

Answer 3

您可以在tessedit_char_whitelist指定數字作為配置選項，如下所示。

ocr_result = pytesseract.image_to_string(image, lang='eng',config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

Answer 4

正如您在此 GitHub 問題中所見，黑名單和白名單不適用於 tesseract 4.0 版。

這個問題有 3 種可能的解決方案，正如我在這篇博客文章中所描述的：

將tesseract更新到版本> 4.1
使用@thewaywewere 的回答中描述的舊模式

創建一個 python 函數，它使用一個簡單的正則表達式來提取所有數字：

 def replace_chars(text): list_of_numbers = re.findall(r'\\d+', text) result_number = ''.join(list_of_numbers) return result_number result_number = pytesseract.image_to_string(im)

pytesseract 僅使用 tesseract 4.0 數字不起作用

問題描述

4 個解決方案

解決方案1
15 2017-10-05 15:38:09

解決方案2
11 2019-03-06 19:31:27

解決方案3
4 2020-06-02 21:35:27

解決方案4
3 2020-03-29 21:24:52

pytesseract 僅使用 tesseract 4.0 數字不起作用

問題描述

4 個解決方案

解決方案1 15 2017-10-05 15:38:09

解決方案2 11 2019-03-06 19:31:27

解決方案3 4 2020-06-02 21:35:27

解決方案4 3 2020-03-29 21:24:52

解決方案1
15 2017-10-05 15:38:09

解決方案2
11 2019-03-06 19:31:27

解決方案3
4 2020-06-02 21:35:27

解決方案4
3 2020-03-29 21:24:52