Pytesseract 不識別小數點

Question

我正在嘗試閱讀此圖像中還包含小數點和小數的文本

這樣：

img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))

我得到的是：

73-82
Primo: 50 —

我也嘗試指定意大利語，但結果非常相似：

73-82 _
Primo: 50

在 stackoverflow 上搜索其他問題時，我發現可以通過使用白名單來改進十進制數的讀取，在本例tessedit_char_whitelist='0123456789.' ，但我也想閱讀圖像中的文字。 關於如何提高十進制數的閱讀的任何想法？

Answer 1

我建議將 tesseract 每一行文本作為單獨的圖像傳遞。
出於某種原因，它似乎解決了小數點問題......

使用cv2.threshold將圖像從灰度轉換為黑白。
對非常長的水平 kernel 使用cv2.dilate形態學操作（跨水平方向合並塊）。
使用查找輪廓 - 每個合並的行都將位於單獨的輪廓中。
找到輪廓的邊界框。
根據 y 坐標對邊界框進行排序。
迭代邊界框，並將切片傳遞給pytesseract 。

這是代碼：

import numpy as np
import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # I am using Windows

path_to_image = 'image.png'

img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE)  # Read input image as Grayscale

# Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Dilate thresh for uniting text areas into blocks of rows.
dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))


# Find contours on dilated_thresh
cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2]  # Use index [-2] to be compatible to OpenCV 3 and 4

# Build a list of bounding boxes
bounding_boxes = [cv2.boundingRect(c) for c in cnts]

# Sort bounding boxes from "top to bottom"
bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])


# Iterate bounding boxes
for b in bounding_boxes:
    x, y, w, h = b

    if (h > 10) and (w > 10):
        # Crop a slice, and inverse black and white (tesseract prefers black text).
        slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]

        text = pytesseract.image_to_string(slice, config="-c tessedit"
                                                          "_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
                                                          " --psm 3"
                                                          " ")

        print(text)

我知道這不是最通用的解決方案，但它設法解決了您發布的示例。
請將答案視為概念解決方案 - 找到一個強大的解決方案可能非常具有挑戰性。

結果：

擴張后的閾值圖像：

第一個切片：

第二片：

第三片：

Output 文字：

7.3-8.2

Primo:50

Answer 2

您可以通過對圖像進行下采樣輕松識別。

如果您下采樣 0.5，結果將是：

現在，如果您閱讀：

7.3 - 8.2
Primo: 50

我通過使用 pytesseract 0.3.7 版本（當前）得到了結果

代碼：

# Load the libraries
import cv2
import pytesseract

# Load the image
img = cv2.imread("s9edQ.png")

# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Down-sample
gry = cv2.resize(gry, (0, 0), fx=0.5, fy=0.5)

# OCR
txt = pytesseract.image_to_string(gry)
print(txt)

解釋：

輸入圖像包含一些人工制品。 您可以在圖像的右側看到它。 另一方面，當前圖像非常適合 OCR 識別。 當圖像中的數據不可見或損壞時，您需要使用預處理方法。 請閱讀以下內容：

圖像處理
頁面分割模式

Pytesseract 不識別小數點

問題描述

2 個解決方案

解決方案1
3 已采納 2021-03-06 22:58:27

解決方案2
1 2021-03-07 12:43:28

Pytesseract 不識別小數點

問題描述

2 個解決方案

解決方案1 3 已采納 2021-03-06 22:58:27

解決方案2 1 2021-03-07 12:43:28

解決方案1
3 已采納 2021-03-06 22:58:27

解決方案2
1 2021-03-07 12:43:28