简体   繁体   English

Open CV OCR 改进了从具有背景的彩色图像中提取数据

[英]Open CV OCR improve data extraction from color image with background

I am trying to extract some info from mobile screen shots.我正在尝试从手机屏幕截图中提取一些信息。 Though my code is able to retrieve some info, but not all of it.虽然我的代码能够检索一些信息,但不是全部。 I read the image converted to grey, then removed non required parts and applied Gaussian Threshold.我读取了转换为灰色的图像,然后删除了不需要的部分并应用了高斯阈值。 But the entire text is not getting read.但是整个文本都没有被阅读。

import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Installs\\Tools\\Tesseract-OCR\\tesseract.exe'

image = "C:\\Workspace\\OCR\\tesseract\\rpstocks1 - Copy (2).png"
img = cv2.imread(image)
img_grey = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)

height, width, channels = img.shape
print (height, width, channels)


rec_img=cv2.rectangle(img_grey,(30,100),(1040,704),(0,255,0),3).copy()

crop_img = rec_img[105:1945, 35:1035].copy()
cv2.medianBlur(img,5)
cv2.imwrite("C:\\Workspace\\OCR\\tesseract\\Cropped_GREY.jpg",crop_img)

img_gauss = cv2.adaptiveThreshold(crop_img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY,11,12)
cv2.imwrite("C:\\Workspace\\OCR\\tesseract\\Cropped_Guass.jpg",img_gauss)
text = pytesseract.image_to_string(img_gauss, lang='eng')
text.encode('utf-8')
print(text)

Output Output

Image Dimensions 704 1080 3图像尺寸 704 1080 3

Investing投资

$9,712.99 
ASRT _ 0
500.46 shares  ......... ..  /0 
GNUS 
25169 Shares  """"" " ‘27.98%

rpstocks1 - Copy (2).png rpstocks1 - 复制 (2).png rpstocks1 - 复制 (2).png Cropped_GREY.jpg Cropped_GREY.jpg Cropped_GREY.jpg Cropped_Guass.jpg Cropped_Guass.jpg Cropped_Guass.jpg

Have a look at the page segmentation modes of pytesseract , cf.看看pytesseract的页面分割模式,cf. this Q&A .这个问答 For example, using config='-psm 12' will already give you all desired texts.例如,使用config='-psm 12'已经为您提供了所有想要的文本。 Nevertheless, those graphs are also somehow interpreted as texts.然而,这些图表也以某种方式被解释为文本。

That's why I would preprocess the image to get single boxes (actual texts, the graphs, those information from the top, etc.), and filter to only store those boxes with the content of interest.这就是为什么我会预处理图像以获取单个框(实际文本、图表、顶部的那些信息等),并过滤以仅存储具有感兴趣内容的框。 That could be done by using这可以通过使用来完成

  • the y coordinate of the bounding rectangle (not in the upper 5 % of the image, that's the mobile phone status bar),边界矩形的y坐标(不在图片的上5%,即手机状态栏),
  • the width w of the bounding rectangle (not wider than 50 % of the image' width, these are the horizontal lines),边界矩形的宽度w (不超过图像宽度的 50%,这些是水平线),
  • the x coordinate of the bounding rectangle (not in middle third of the image, these are the graphs).边界矩形的x坐标(不在图像的中间三分之一处,这些是图形)。

What's left is to run pytesseract on each cropped image with config='-psm 6' for example ( assume a single uniform block of text ), and clean the texts from any line breaks.剩下的就是使用config='-psm 6'在每个裁剪的图像上运行pytesseract例如(假设一个统一的文本块),并从任何换行符中清除文本。

That'd be my code:那将是我的代码:

import cv2
import pytesseract

# Read image
img = cv2.imread('cUcby.png')
hi, wi = img.shape[:2]

# Convert to grayscale for tesseraact
img_grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Mask single boxes by thresholding and morphological closing in x diretion
mask = cv2.threshold(img_grey, 248, 255, cv2.THRESH_BINARY_INV)[1]
mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE,
                        cv2.getStructuringElement(cv2.MORPH_RECT, (51, 1)))

# Find contours w.r.t. the OpenCV version
cnts = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# Get bounding rectangles
rects = [cv2.boundingRect(cnt) for cnt in cnts]

# Filter bounding rectangles:
# - not in the upper 5 % of the image (mobile phone status bar)
# - not wider than 50 % of the image' width (horizontal lines)
# - not being in the middle third of the image (graphs)
rects = [(x, y, w, h) for x, y, w, h in rects if
         (y > 0.05 * hi) and
         (w <= 0.5 * wi) and
         ((x < 0.3333 * wi) or (x > 0.6666 * wi))]

# Sort bounding rectangles first by y coordinate, then by x coordinate
rects = sorted(rects, key=lambda x: (x[1], x[0]))

# Get texts from bounding rectangles from pytesseract
texts = [pytesseract.image_to_string(
    img_grey[y-1:y+h+1, x-1:x+w+1], config='-psm 6') for x, y, w, h in rects]

# Remove line breaks
texts = [text.replace('\n', '') for text in texts]

# Output
print(texts)

And, that's the output:而且,这就是 output:

['Investing', '$9,712.99', 'ASRT', '-27.64%', '500.46 shares', 'GNUS', '-27.98%', '251.69 shares']

Since you have the locations of the bounding rectangles, you could also re-arrange the whole text using that information.由于您有边界矩形的位置,您还可以使用该信息重新排列整个文本。

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.16299-SP0
Python:        3.9.1
PyCharm:       2021.1.1
OpenCV:        4.5.1
pytesseract:   4.00.00alpha
----------------------------------------

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM