简体   繁体   中英

Pytesseract doesn't recognize decimal points

I'm trying to read the text in this image that contains also decimal points and decimal numbers在此处输入图像描述

in this way:

img = cv2.imread(path_to_image)
print(pytesseract.image_to_string(img))

and what I get is:

73-82
Primo: 50 —

I've tried to specify also the italian language but the result is pretty similar:

73-82 _
Primo: 50

Searching through other questions on stackoverflow I found that the reading of the decimal numbers can be improved by using a whitelist, in this case tessedit_char_whitelist='0123456789.' , but I want to read also the words in the image. Any idea on how to improve the reading of decimal numbers?

I would suggest passing tesseract every row of text as separate image.
For some reason it seams to solve the decimal point issue...

  • Convert image from grayscale to black and white using cv2.threshold .
  • Use cv2.dilate morphological operation with very long horizontal kernel (merge blocks across horizontal direction).
  • Use find contours - each merged row is going to be in a separate contour.
  • Find bounding boxes of the contours.
  • Sort the bounding boxes according to the y coordinate.
  • Iterate bounding boxes, and pass slices to pytesseract .

Here is the code:

import numpy as np
import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # I am using Windows

path_to_image = 'image.png'

img = cv2.imread(path_to_image, cv2.IMREAD_GRAYSCALE)  # Read input image as Grayscale

# Convert to binary using automatic threshold (use cv2.THRESH_OTSU)
ret, thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Dilate thresh for uniting text areas into blocks of rows.
dilated_thresh = cv2.dilate(thresh, np.ones((3,100)))


# Find contours on dilated_thresh
cnts = cv2.findContours(dilated_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2]  # Use index [-2] to be compatible to OpenCV 3 and 4

# Build a list of bounding boxes
bounding_boxes = [cv2.boundingRect(c) for c in cnts]

# Sort bounding boxes from "top to bottom"
bounding_boxes = sorted(bounding_boxes, key=lambda b: b[1])


# Iterate bounding boxes
for b in bounding_boxes:
    x, y, w, h = b

    if (h > 10) and (w > 10):
        # Crop a slice, and inverse black and white (tesseract prefers black text).
        slice = 255 - thresh[max(y-10, 0):min(y+h+10, thresh.shape[0]), max(x-10, 0):min(x+w+10, thresh.shape[1])]

        text = pytesseract.image_to_string(slice, config="-c tessedit"
                                                          "_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-:."
                                                          " --psm 3"
                                                          " ")

        print(text)

I know it's not the most general solution, but it manages to solve the sample you have posted.
Please treat the answer as a conceptual solution - finding a robust solution might be very challenging.


Results:

Thresholder image after dilate:
在此处输入图像描述

First slice:
在此处输入图像描述

Second slice:
在此处输入图像描述

Third slice:
在此处输入图像描述

Output text:

7.3-8.2

Primo:50

You can easily recognize by down-sampling the image.

If you down-sample by 0.5, result will be:

在此处输入图像描述

Now if you read:

7.3 - 8.2
Primo: 50

I got the result by using pytesseract 0.3.7 version ( current )

Code:


# Load the libraries
import cv2
import pytesseract

# Load the image
img = cv2.imread("s9edQ.png")

# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Down-sample
gry = cv2.resize(gry, (0, 0), fx=0.5, fy=0.5)

# OCR
txt = pytesseract.image_to_string(gry)
print(txt)

Explanation:


The input-image contains a little bit of an artifact. You can see it on the right part of the image. On the other hand, the current image is perfect for OCR recognition. You need to use the pre-preprocessing method when the data from the image is not visible or corrupted. Please read the followings:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM