简体   繁体   中英

Remove text boxes for OCR with OpenCV

I am trying to run OCR (using Google's Tesseract) on a document with the following format:

在此处输入图片说明

However, Tesseract assumes the short bars in between to be letters/numbers (l or i or 1).

As a pre-processing measure I tried to remove vertical and horizontal lines using the following code:

import cv2

from pdf2image import convert_from_path

pages = convert_from_path('..\\app\\1.pdf', 500)
for page in pages:
    page.save('..\\app\\out.jpg', 'JPEG')

image = cv2.imread('out.jpg')
result = image.copy()
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Remove horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40,1))
remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(result, [c], -1, (255,255,255), 5)

# Remove vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,40))
remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(result, [c], -1, (255,255,255), 5)

cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.imwrite('result.png', result)
cv2.waitKey()

I run into an issue where the output of this document removes most of the vertical and horizontal lines in the document even the start and the finish line on the left and right side of the image below but not the small bars in between.

I'm wondering if I am going about this wrong by trying to pre-process and remove lines. Is there a better way to pre-process or another way to solve this problem?

With the observation that the form fields are separate from the characters, you can simply filter using contour area to isolate the text characters. The idea is to Gaussian blur , then Otsu's threshold to obtain a binary image. From here we find contours and filter using contour area with some predetermined threshold value. We can effectively remove the lines by drawing in the contours with cv2.drawContours .


Binary image

在此处输入图片说明

Removed lines

在此处输入图片说明

Invert ready for OCR

在此处输入图片说明

OCR result using Pytesseract

HELLO

Code

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image, grayscale, blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Find contours and filter using contour area
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    area = cv2.contourArea(c)
    if area > 500:
        cv2.drawContours(thresh, [c], -1, 0, -1)

# Invert image and OCR
invert = 255 - thresh
data = pytesseract.image_to_string(invert, lang='eng',config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('invert', invert)
cv2.waitKey()

Note: If you still want to go with the remove horizontal/vertical lines approach, you need to modify the vertical kernel size. For instance, change (1,40) to (1,10) . This will help to remove smaller lines but it may also remove some of the vertical lines in the text such as in L .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM