简体   繁体   English

Pytesseract 不识别小数点

[英]Pytesseract doesn't recognize decimal points

I'm trying to read the text in this image that contains also decimal points and decimal numbers我正在尝试阅读此图像中还包含小数点和小数的文本在此处输入图像描述

in this way:这样:

img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))

and what I get is:我得到的是:

73-82
Primo: 50 —

I've tried to specify also the italian language but the result is pretty similar:我也尝试指定意大利语,但结果非常相似:

73-82 _
Primo: 50

Searching through other questions on stackoverflow I found that the reading of the decimal numbers can be improved by using a whitelist, in this case tessedit_char_whitelist='0123456789.'在 stackoverflow 上搜索其他问题时,我发现可以通过使用白名单来改进十进制数的读取,在本例tessedit_char_whitelist='0123456789.' , but I want to read also the words in the image. ,但我也想阅读图像中的文字。 Any idea on how to improve the reading of decimal numbers?关于如何提高十进制数的阅读的任何想法?

I would suggest passing tesseract every row of text as separate image.我建议将 tesseract 每一行文本作为单独的图像传递。
For some reason it seams to solve the decimal point issue...出于某种原因,它似乎解决了小数点问题......

  • Convert image from grayscale to black and white using cv2.threshold .使用cv2.threshold将图像从灰度转换为黑白。
  • Use cv2.dilate morphological operation with very long horizontal kernel (merge blocks across horizontal direction).对非常长的水平 kernel 使用cv2.dilate形态学操作(跨水平方向合并块)。
  • Use find contours - each merged row is going to be in a separate contour.使用查找轮廓 - 每个合并的行都将位于单独的轮廓中。
  • Find bounding boxes of the contours.找到轮廓的边界框。
  • Sort the bounding boxes according to the y coordinate.根据 y 坐标对边界框进行排序。
  • Iterate bounding boxes, and pass slices to pytesseract .迭代边界框,并将切片传递给pytesseract

Here is the code:这是代码:

import numpy as np
import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # I am using Windows

path_to_image = 'image.png'

img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE)  # Read input image as Grayscale

# Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Dilate thresh for uniting text areas into blocks of rows.
dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))


# Find contours on dilated_thresh
cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2]  # Use index [-2] to be compatible to OpenCV 3 and 4

# Build a list of bounding boxes
bounding_boxes = [cv2.boundingRect(c) for c in cnts]

# Sort bounding boxes from "top to bottom"
bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])


# Iterate bounding boxes
for b in bounding_boxes:
    x, y, w, h = b

    if (h > 10) and (w > 10):
        # Crop a slice, and inverse black and white (tesseract prefers black text).
        slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]

        text = pytesseract.image_to_string(slice, config="-c tessedit"
                                                          "_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
                                                          " --psm 3"
                                                          " ")

        print(text)

I know it's not the most general solution, but it manages to solve the sample you have posted.我知道这不是最通用的解决方案,但它设法解决了您发布的示例。
Please treat the answer as a conceptual solution - finding a robust solution might be very challenging.请将答案视为概念解决方案 - 找到一个强大的解决方案可能非常具有挑战性。


Results:结果:

Thresholder image after dilate:扩张后的阈值图像:
在此处输入图像描述

First slice:第一个切片:
在此处输入图像描述

Second slice:第二片:
在此处输入图像描述

Third slice:第三片:
在此处输入图像描述

Output text: Output 文字:

7.3-8.2

Primo:50

You can easily recognize by down-sampling the image.您可以通过对图像进行下采样轻松识别。

If you down-sample by 0.5, result will be:如果您下采样 0.5,结果将是:

在此处输入图像描述

Now if you read:现在,如果您阅读:

7.3 - 8.2
Primo: 50

I got the result by using pytesseract 0.3.7 version ( current )我通过使用 pytesseract 0.3.7 版本(当前)得到了结果

Code:代码:


# Load the libraries
import cv2
import pytesseract

# Load the image
img = cv2.imread("s9edQ.png")

# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Down-sample
gry = cv2.resize(gry, (0, 0), fx=0.5, fy=0.5)

# OCR
txt = pytesseract.image_to_string(gry)
print(txt)

Explanation:解释:


The input-image contains a little bit of an artifact.输入图像包含一些人工制品。 You can see it on the right part of the image.您可以在图像的右侧看到它。 On the other hand, the current image is perfect for OCR recognition.另一方面,当前图像非常适合 OCR 识别。 You need to use the pre-preprocessing method when the data from the image is not visible or corrupted.当图像中的数据不可见或损坏时,您需要使用预处理方法。 Please read the followings:请阅读以下内容:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM