简体   繁体   中英

Open CV OCR improve data extraction from color image with background

I am trying to extract some info from mobile screen shots. Though my code is able to retrieve some info, but not all of it. I read the image converted to grey, then removed non required parts and applied Gaussian Threshold. But the entire text is not getting read.

import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Installs\\Tools\\Tesseract-OCR\\tesseract.exe'

image = "C:\\Workspace\\OCR\\tesseract\\rpstocks1 - Copy (2).png"
img = cv2.imread(image)
img_grey = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)

height, width, channels = img.shape
print (height, width, channels)


rec_img=cv2.rectangle(img_grey,(30,100),(1040,704),(0,255,0),3).copy()

crop_img = rec_img[105:1945, 35:1035].copy()
cv2.medianBlur(img,5)
cv2.imwrite("C:\\Workspace\\OCR\\tesseract\\Cropped_GREY.jpg",crop_img)

img_gauss = cv2.adaptiveThreshold(crop_img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,12)
cv2.imwrite("C:\\Workspace\\OCR\\tesseract\\Cropped_Guass.jpg",img_gauss)
text = pytesseract.image_to_string(img_gauss, lang='eng')
text.encode('utf-8')
print(text)

Output

Image Dimensions 704 1080 3

Investing

$9,712.99 
ASRT _ 0
500.46 shares  ......... ..  /0 
GNUS 
25169 Shares  """"" " ‘27.98%

rpstocks1 - Copy (2).png rpstocks1 - 复制 (2).png Cropped_GREY.jpg Cropped_GREY.jpg Cropped_Guass.jpg Cropped_Guass.jpg

Have a look at the page segmentation modes of pytesseract , cf. this Q&A . For example, using config='-psm 12' will already give you all desired texts. Nevertheless, those graphs are also somehow interpreted as texts.

That's why I would preprocess the image to get single boxes (actual texts, the graphs, those information from the top, etc.), and filter to only store those boxes with the content of interest. That could be done by using

  • the y coordinate of the bounding rectangle (not in the upper 5 % of the image, that's the mobile phone status bar),
  • the width w of the bounding rectangle (not wider than 50 % of the image' width, these are the horizontal lines),
  • the x coordinate of the bounding rectangle (not in middle third of the image, these are the graphs).

What's left is to run pytesseract on each cropped image with config='-psm 6' for example ( assume a single uniform block of text ), and clean the texts from any line breaks.

That'd be my code:

import cv2
import pytesseract

# Read image
img = cv2.imread('cUcby.png')
hi, wi = img.shape[:2]

# Convert to grayscale for tesseraact
img_grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Mask single boxes by thresholding and morphological closing in x diretion
mask = cv2.threshold(img_grey, 248, 255, cv2.THRESH_BINARY_INV)[1]
mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE,
                        cv2.getStructuringElement(cv2.MORPH_RECT, (51, 1)))

# Find contours w.r.t. the OpenCV version
cnts = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# Get bounding rectangles
rects = [cv2.boundingRect(cnt) for cnt in cnts]

# Filter bounding rectangles:
# - not in the upper 5 % of the image (mobile phone status bar)
# - not wider than 50 % of the image' width (horizontal lines)
# - not being in the middle third of the image (graphs)
rects = [(x, y, w, h) for x, y, w, h in rects if
         (y > 0.05 * hi) and
         (w <= 0.5 * wi) and
         ((x < 0.3333 * wi) or (x > 0.6666 * wi))]

# Sort bounding rectangles first by y coordinate, then by x coordinate
rects = sorted(rects, key=lambda x: (x[1], x[0]))

# Get texts from bounding rectangles from pytesseract
texts = [pytesseract.image_to_string(
    img_grey[y-1:y+h+1, x-1:x+w+1], config='-psm 6') for x, y, w, h in rects]

# Remove line breaks
texts = [text.replace('\n', '') for text in texts]

# Output
print(texts)

And, that's the output:

['Investing', '$9,712.99', 'ASRT', '-27.64%', '500.46 shares', 'GNUS', '-27.98%', '251.69 shares']

Since you have the locations of the bounding rectangles, you could also re-arrange the whole text using that information.

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.16299-SP0
Python:        3.9.1
PyCharm:       2021.1.1
OpenCV:        4.5.1
pytesseract:   4.00.00alpha
----------------------------------------

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM