Pytesseract (Tesseract OCR) 沒有收到一些數字

Question

我一直在開發一個使用光學字符識別來讀取財務報表的程序，而在我的一生中，我無法弄清楚為什么我正在使用的開源模塊仍然無法讀取某些數字。

我在檢測到文本的原始輸入周圍創建了一個帶有綠色框的輸出文件。 在這種情況下，“381”的行被選中，但下面的行（具有相同的確切格式）被忽略。

我在提取數據之前使用此代碼對圖像進行預處理，因為以前的未命中率高達 20%，現在接近 5%。

img = cv2.imread(filename)
img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

在這個預處理之后，我還運行了一個算法來從文檔中刪除超過一定大小的實線，但在這種情況下，原始文件中沒有“35”或“381”下划線，所以我懷疑這是導致問題的原因。 我還驗證了線檢測算法沒有刪除 5 的頂部。

我不是 OCR 或 CV 方面的專家，我的專長是更多的數據和通用編程——我真的只需要讓這個庫完成它宣傳的工作，這樣我就可以繼續前進並完成程序。 有誰知道什么可能導致這個問題？

Answer 1

我建議考慮將您的配置設置為特定的頁面分割方法 (PSM)，例如 11，因為您正在尋找稀疏文本。 例如，我的代碼有：

results = pytesseract.image_to_data(Image.open(tempFile), lang='eng', config='--psm 11', output_type=Output.DICT)

PSM 如下：

  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
                        bypassing hacks that are Tesseract-specific.

還有一種通過數字而不是文本搜索的方法也可能有所幫助。

Pytesseract (Tesseract OCR) 沒有收到一些數字

問題描述

1 個解決方案

解決方案1
4 已采納 2021-10-27 16:08:44

Pytesseract (Tesseract OCR) 沒有收到一些數字

問題描述

1 個解決方案

解決方案1 4 已采納 2021-10-27 16:08:44

解決方案1
4 已采納 2021-10-27 16:08:44