简体   繁体   中英

Read text from image using OCR for the image which have two columns or three columns of data using python

In the example image (just a reference, my images will be of same pattern) a page which have full horizontal text and other have two horizontal column of text.

在此处输入图片说明

How to automatically detect the pattern of the document and read one after the other column of data in python?.

I am using Tesseract OCR with Psm 6, where it is reading horizontally which is wrong.

One way to accomplish this is using morphological operations and contour detection.

With the former you essentially "bleed" all characters into a big chunky blob. With the latter, you locate these blobs in your image and extract the ones that seem interesting (meaning: big enough). 提取的轮廓

Script used:

import cv2
import sys

SCALE = 4
AREA_THRESHOLD = 427505.0 / 2

def show_scaled(name, img):
    try:
        h, w  = img.shape
    except ValueError:
        h, w, _  = img.shape
    cv2.imshow(name, cv2.resize(img, (w // SCALE, h // SCALE)))

def main():
    img = cv2.imread(sys.argv[1])
    img = img[10:-10, 10:-10] # remove the border, it confuses contour detection
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    show_scaled("original", gray)

    # black and white, and inverted, because
    # white pixels are treated as objects in
    # contour detection
    thresholded = cv2.adaptiveThreshold(
                gray, 255,
                cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV,
                25,
                15
            )
    show_scaled('thresholded', thresholded)
    # I use a kernel that is wide enough to connect characters
    # but not text blocks, and tall enough to connect lines.
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (13, 33))
    closing = cv2.morphologyEx(thresholded, cv2.MORPH_CLOSE, kernel)

    im2, contours, hierarchy = cv2.findContours(closing, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    show_scaled("closing", closing)

    for contour in contours:
        convex_contour = cv2.convexHull(contour)
        area = cv2.contourArea(convex_contour)
        if area > AREA_THRESHOLD:
            cv2.drawContours(img, [convex_contour], -1, (255,0,0), 3)

    show_scaled("contours", img)
    cv2.imwrite("/tmp/contours.png", img)
    cv2.waitKey()

if __name__ == '__main__':
    main()

Then all you need is to compute the bounding box of the contour, and cut it from the original image. Add a bit of a margin and feed the whole thing to tesseract.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM