简体   繁体   中英

How do I accurately retrieve the bounding box of an object detected using Tensorflow Object Detection API?

I'm trying to understand how to find the location of the bounding box when an object is detected. I used the Tensorflow Object Detection API to detect a mouse in a box. Just for testing purposes of how to retrieve the bounding box coordinates, when the mouse is detected, I want to print "THIS IS A MOUSE" right above its head. However, mine currently prints several inches off-kilter. For example, here is a screenshot from a video of my object detection.

截图

Here is the relevant code snippet:

with detection_graph.as_default():
with tf.Session(graph=detection_graph) as sess:
    start = time.time()
    while True:

        # Read frame from camera
        ret, image_np = cap.read()

        cv2.putText(image_np, "Time Elapsed: {}s".format(int(time.time() - start)), (50,50),cv2.FONT_HERSHEY_PLAIN,3, (0,0,255),3)
        # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
        image_np_expanded = np.expand_dims(image_np, axis=0)
        # Extract image tensor
        image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
        # Extract detection boxes
        boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
        # Extract detection scores
        scores = detection_graph.get_tensor_by_name('detection_scores:0')
        # Extract detection classes
        classes = detection_graph.get_tensor_by_name('detection_classes:0')
        # Extract number of detectionsd
        num_detections = detection_graph.get_tensor_by_name(
            'num_detections:0')
        # Actual detection.
        (boxes, scores, classes, num_detections) = sess.run(
            [boxes, scores, classes, num_detections],
            feed_dict={image_tensor: image_np_expanded})
        # Visualization of the results of a detection.
        vis_util.visualize_boxes_and_labels_on_image_array(
            image_np,
            np.squeeze(boxes),
            np.squeeze(classes).astype(np.int32),
            np.squeeze(scores),
            category_index,
            use_normalized_coordinates=True,
            line_thickness=8)

        for i, b in enumerate(boxes[0]):
            if classes[0][i] == 1:
                if scores[0][i] >= .5:
                    mid_x = (boxes[0][i][3] + boxes[0][i][1]) / 2
                    mid_y = (boxes[0][i][2] + boxes[0][i][0]) / 2


                    cv2.putText(image_np, 'FOUND A MOUSE', (int(mid_x*600), int(mid_y*800)), cv2.FONT_HERSHEY_PLAIN, 2, (0,255,0), 3)

        # Display output
        cv2.imshow(vid_name, cv2.resize(image_np, (800, 600)))

        #Write to output
        video_writer.write(image_np)

        if cv2.waitKey(25) & 0xFF == ord('q'):
            cv2.destroyAllWindows()
            break


    cap.release()
    cv2.destroyAllWindows()

It's not really clear to me how boxes works. Can someone explain this line to me: mid_x = (boxes[0][i][3] + boxes[0][i][1]) / 2 ? I understand that the 3 and 1 indices represent x_min , x_max , but I'm not sure why I'm iterating through boxes[0] only and what i represents.

Solution Just as ievbu suggested, I needed to convert the midpoint calculation from its normalized values to values for the frame. I found a cv2 function that returns the width and height and used those values to convert my midpoint to pixel location.

frame_h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
frame_w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))

...
cv2.putText(image_np, '.', (int(mid_x*frame_w), int(mid_y*frame_h)), cv2.FONT_HERSHEY_PLAIN, 2, (0,255,0), 3)

Boxes are returned in higher dimension, because you can give multiple images and then that dimension would represent every separate image (for one input image you expand dimension with np.expand_dims ). You can see that for visualization it is removed using np.squeeze and you can remove it manually just by taking boxes[0] if you process only 1 image. i represents index of box in boxes array, you need that index to access class and score of the box that you analyze.

The text is not in correct position because returned boxes coordinates are normalized and you have to convert them to match full image size. Here is example how you can convert them:

(im_width, im_height, _) = frame.shape
xmin, ymin, xmax, ymax = box
(xmin, xmax, ymin, ymax) = (xmin * im_width, xmax * im_width,
                            ymin * im_height, ymax * im_height)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM