简体   繁体   中英

Finding text between two lines using Python OpenCV

I want to identify and highlight / crop the text between two lines using Python (cv2).

One line is a wavy line at the top, and the second line somewhere in the page. This line can appear at any height on the page, ranging from just after 1 line to just before the last line.

An example,

第 1 页

I believe I need to use HoughLinesP() somehow with proper parameters for this. I've tried some examples involving a combination of erode + dilate + HoughLinesP .

eg


    img = cv2.imread(image)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    kernel_size = 5
    blur_gray = cv2.GaussianBlur(gray, (kernel_size, kernel_size), 0)

    # erode / dilate
    erode_kernel_param = (5, 200)   # (5, 50)
    dilate_kernel_param = (5, 5)  # (5, 75)

    img_erode = cv2.erode(blur_gray, np.ones(erode_kernel_param))
    img_dilate = cv2.dilate(img_erode, np.ones(dilate_kernel_param))

    # %% Second, process edge detection use Canny.

    low_threshold = 50
    high_threshold = 150
    edges = cv2.Canny(img_dilate, low_threshold, high_threshold)

    # %% Then, use HoughLinesP to get the lines.
    # Adjust the parameters for better performance.

    rho = 1  # distance resolution in pixels of the Hough grid
    theta = np.pi / 180  # angular resolution in radians of the Hough grid
    threshold = 15  # min number of votes (intersections in Hough grid cell)
    min_line_length = 600  # min number of pixels making up a line
    max_line_gap = 20  # max gap in pixels between connectable line segments
    line_image = np.copy(img) * 0  # creating a blank to draw lines on

    # %%  Run Hough on edge detected image
    # Output "lines" is an array containing endpoints of detected line segments

    lines = cv2.HoughLinesP(edges, rho, theta, threshold, np.array([]),
                            min_line_length, max_line_gap)

    if lines is not None:
        for line in lines:
            for x1, y1, x2, y2 in line:
                cv2.line(line_image, (x1, y1), (x2, y2), (255, 0, 0), 5)

    # %% Draw the lines on the  image

    lines_edges = cv2.addWeighted(img, 0.8, line_image, 1, 0)

However, in many cases the lines dont get identified propery. Some examples of errors being,

  1. Too many lines being identified (ones in the text as well)
  2. Lines not being identified completely
  3. Lines not being identified at all

Am I on the right track? Do I just need to hit the correct combination of parameters for this purpose? or is there a simpler way / trick which will let me reliably crop the text between these two lines?

In case it's relevant, I need to do this for ~450 pages. Here's the link to the book, in case someone wants to examine more examples of pages. https://archive.org/details/in.ernet.dli.2015.553713/page/n13/mode/2up

Thank you.


Solution

I've made minor modifications to the answer by Ari (Thank you), and made the code a bit more comprehensible for my own sake, here's my code.

The core idea is,

  • Find contours and their bounding rectangles.
  • Two "widest" contours would represent the two lines.
  • Thereafter, take the lower side of the top rectangle and upper side of the bottom rectangle to bound the area (text) we are interested in.

for image in images:
    base_img = cv2.imread(image)
    height, width, channels = base_img.shape

    img = cv2.cvtColor(base_img, cv2.COLOR_BGR2GRAY)
    ret, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    img = cv2.bitwise_not(img)

    contours, hierarchy = cv2.findContours(
        img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE
    )

    # Get rectangle bounding contour
    rects = [cv2.boundingRect(contour) for contour in contours]

    # Rectangle is (x, y, w, h)
    # Top-Left point of the image is (0, 0), rightwards X, downwards Y

    # Sort the contours bigger width first
    rects.sort(key=lambda r: r[2], reverse=True)

    # Get the 2 "widest" rectangles
    line_rects = rects[:2]
    line_rects.sort(key=lambda r: r[1])

    # If at least two rectangles (contours) were found
    if len(line_rects) >= 2:
        top_x, top_y, top_w, top_h = line_rects[0]
        bot_x, bot_y, bot_w, bot_h = line_rects[1]

        # Cropping the img
        # Crop between bottom y of the upper rectangle (i.e. top_y + top_h)
        # and the top y of lower rectangle (i.e. bot_y)
        crop_img = base_img[top_y+top_h:bot_y]

        # Highlight the area by drawing the rectangle
        # For full width, 0 and width can be used, while
        # For exact width (erroneous) top_x and bot_x + bot_w can be used
        rect_img = cv2.rectangle(
            base_img,
            pt1=(0, top_y + top_h),
            pt2=(width, bot_y),
            color=(0, 255, 0),
            thickness=2
        )
        cv2.imwrite(image.replace('.jpg', '.rect.jpg'), rect_img)
        cv2.imwrite(image.replace('.jpg', '.crop.jpg'), crop_img)
    else:
        print(f"Insufficient contours in {image}")

You can find the Contours, and then take the two with the biggest width.

base_img = cv2.imread('a.png')

img = cv2.cvtColor(base_img, cv2.COLOR_BGR2GRAY)
ret, img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
img = cv2.bitwise_not(img)

cnts, hierarchy = cv2.findContours(img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

# sort the cnts bigger width first
cnts.sort(key=lambda c: cv2.boundingRect(c)[2], reverse=True)

# get the 2 big lines
lines = [cv2.boundingRect(cnts[0]), cv2.boundingRect(cnts[1])]
# higher line first
lines.sort(key=lambda c: c[1])
# croping the img
crop_img = base_img[lines[0][1]:lines[1][1]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM