How to merge the bounding boxes into one

Question

I have some sketched images where my goal is to segment the subfigures or objects. I have the original image and the corresponding mask image. My goal is to detect the contour of the mask image and use that contour information to draw the bounding box on the original image. My code is working well for most of the images. But not working for all images.

Success scenario: attached image 1:

Failed scenario: attached image 2:

I have attached images where my code fails to produce a clean bounding box.

attached image 3: unsuccessful original image

attached image 4: unsuccessful original mask

attached image 5: unsuccessful original image

attached image 6: unsuccessful original mask

My code:

import cv2
import pytesseract
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.patches import Circle
import os
os.chdir(r'D:\job\LAL\data\data\400_figures\test')
img = cv2.imread('9.jpg') ## original image
img1 = cv2.imread('output_9.jpg') ## masked image

gray = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
canny_get_edge= cv2.Canny(gray,40,250)
#contours, hierarchy= cv2.findContours(canny_get_edge, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[-2:]
#contours, hierarchy= cv2.findContours(canny_get_edge, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours, hierarchy = cv2.findContours(canny_get_edge, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)[-2:]
cv2.drawContours(img, contours, -1, (0,255,0), 4)
cv2.imshow('Contours', img)
cv2.imwrite('result9.jpg', img)
#os.remove("output.png")

#os.chdir(r'D:\job\LAL\data\data\400_figures\renamed_400_figures')

# Load original image
im = cv2.imread('9.jpg') ## original image
for c in contours:
    rect = cv2.boundingRect(c)
    if rect[2] < 50 or rect[3] < 50: continue
    cv2.contourArea(c)
    x, y, w, h = rect
    cv2.rectangle(im, (x, y), (x + w, y + h), (0, 255, 0), 2)
    cv2.putText(im, 'Detected', (x + w + 10, y + h), 0, 0.3, (0, 255, 0))
cv2.imshow('cc',im)
cv2.imwrite('box9.jpg', im)

Answer 1

Here's a possible solution. However, be aware that I cannot guarantee this will work with complex images you haven't posted yet . Also, you are receiving free help from strangers on the internet, don't expect a full solution that will solve your problems without any effort on your part. It's cool to help, but please, set your expectations accordingly and reasonably.

The approach involves getting the bounding box for the biggest object on the image, these are the assumptions :

You process a sketch and caption PER image, this approach won't help if you have multiple figures on one image. You will have to cut them manually. For example, the image of the BBQ Grill – that must be separated in two images
Some figures and their caption cannot be separated with a rectangle – that's because enclosing the figure with a quadrilateral with four right angles will also enclose the caption if the latter is located within the area of said quadrilateral (you would have to filter the caption before this approach, extend this approach, or crop using a polygon - and that's another different problem)

The approach involves reducing the image to its horizontal and vertical projections . We just need the starting and ending point of both projections and we should be able to construct a bounding rectangle. The projection is just (ideally) a line, however, if the caption of the figure is not overlapped by the projection of the figure, we can filter it out by processing just the biggest projection of them all. It's a nice approach that fits well for images like the BBQ Grill.

These are the steps:

Convert the input to grayscale
Resize the grayscale image because the images are gigantic, we don't need all that info
Apply some morphology – A small closing to join little bits of the figure will do
Reduce the image to its horizontal and vertical projections using the cv2.reduce function
Filter the biggest/largest projection
Get the starting and ending points (actually just a number) of the projection
Construct the bounding rectangle
Upscale the bounding rectangle

I've manually separated the grill image here: Part 1 and Part 2 . Let's see the code:

# Imports
import cv2
import numpy as np
Read image
imagePath = "D://opencvImages//"
inputImage = cv2.imread(imagePath+"sketch03.png")

# Convert BGR to grayscale:
grayscaleImage = cv2.cvtColor(inputImage, cv2.COLOR_BGR2GRAY)

# Get image dimensions
originalImageHeight, originalImageWidth = grayscaleImage.shape[:2]

# Resize at a fixed scale:
resizePercent = 30
resizedWidth = int(originalImageWidth * resizePercent / 100)
resizedHeight = int(originalImageHeight * resizePercent / 100)

# resize image
resizedImage = cv2.resize(grayscaleImage, (resizedWidth, resizedHeight))

# Threshold via Otsu:
_, binaryImage = cv2.threshold(resizedImage, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

The first bit resizes the image to a scale percentage of 30 . That's enough for the images you posted. The procedure is fairly straightforward and yields this (down-scaled) binary image:

We can apply a little bit of morphology to join the smaller parts of the figure into one solid component. Let's apply a closing (dilation followed by erosion) with a 3 x 3 rectangular structuring element:

# Perform a little bit of morphology:
# Set kernel (structuring element) size:
kernelSize = (3, 3)
# Set operation iterations:
opIterations = 1
# Get the structuring element:
morphKernel = cv2.getStructuringElement(cv2.MORPH_RECT, kernelSize)
# Perform Dilate:
binaryImage = cv2.morphologyEx(binaryImage, cv2.MORPH_CLOSE, morphKernel, None, None, opIterations, cv2.BORDER_REFLECT101)

This is the result:

Ok, let's reduce the image. We first get the horizontal projection by reducing the rows and then the vertical projection by reducing the columns . We will use the MAX mode, where each pixel value in the row/column is defined as the maximum intensity value corresponding to that image row/column.

Immediately after computing the projection, we can filter the smallest line in the image. We can compute the contours , get its "bounding rectangle" (in fact the rectangle is just a starting/ending point since the projection is just a line ) and keep the largest one. During this step we can also store the starting/ending point:

# Set number of reductions (dimensions):
dimensions = 2
# Store the data of both reductions here:
boundingRectsList = []

# Reduce the image:
for i in range(dimensions):

    # Reduce image, first horizontal, then vertical:
    reducedImg = cv2.reduce(binaryImage, i, cv2.REDUCE_MAX)

    # Get biggest line (biggest blob) and its start/ending coordinate,
    # set initial values for the largest contour:
    largestArea = 0

    # Find the contours on the binary image:
    contours, hierarchy = cv2.findContours(reducedImg, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)

    # Create temporal tuple to store the rectangle data:
    tempRect = ()

    # Get the largest contour in the contours list:
    for j, c in enumerate(contours):
        boundRect = cv2.boundingRect(c)

        # Get the dimensions of the bounding rect:
        rectX = boundRect[0]
        rectY = boundRect[1]
        rectWidth = boundRect[2]
        rectHeight = boundRect[3]

        # Get the bounding rect area:
        area = rectWidth * rectHeight

        # Store the info of the largest contour:
        if area > largestArea:
            largestArea = area
            # Store the bounding rectangle data:
            if i == 0:
                # the first dimension is horizontal
                tempRect = (rectX, rectWidth)
            else:
                # the second dimension is vertical:
                tempRect = (rectY, rectHeight)

    # Got the biggest contour:
    boundingRectsList.append(tempRect)

And that's pretty much the meat of the process. These images show the horizontal and vertical projections of the first image.

Horizontal projection:

Vertical projection:

Note the second (smaller) line on the horizontal projection. This corresponds to the caption, which has been ignored by our "biggest area filter". All the relevant info is stored in the boundingRectsList variable. Let's construct the bounding rectangle, upscale the info and show the rectangle on the original, up-scaled, input:

# Compute resize factors:
horizontalFactor = originalImageWidth/resizedWidth
verticalFactor = originalImageHeight/resizedHeight

# Create bounding box:
boundingRectX = boundingRectsList[0][0] * horizontalFactor
boundingRectY = boundingRectsList[1][0] * verticalFactor

boundingRectWidth = boundingRectsList[0][1] * horizontalFactor
boundingRectHeight = boundingRectsList[1][1] * verticalFactor

# Set bounding rectangle:
binaryImageColor = cv2.cvtColor(binaryImage, cv2.COLOR_GRAY2BGR)
color = (0, 0, 255)
cv2.rectangle(inputImage, (int(boundingRectX), int(boundingRectY)),
              (int(boundingRectX + boundingRectWidth), int(boundingRectY + boundingRectHeight)), color, 2)

# Show image:
cv2.imshow("Rectangle", inputImage)
cv2.waitKey

This yields:

The second image of the grill:

The first image of the shoe:

How to merge the bounding boxes into one

Question

1 answers

solution1
3 ACCPTED 2021-07-03 06:01:03

How to merge the bounding boxes into one

Question

1 answers

solution1 3 ACCPTED 2021-07-03 06:01:03

solution1
3 ACCPTED 2021-07-03 06:01:03