使用多 GPU 和多线程、Pytorch 的对象检测推理

Question

I am trying to detect objects in a video using multiple GPUs.我正在尝试使用多个 GPU 检测视频中的对象。 I want to distribute frames to GPUs for inference to increase total process time.我想将帧分发到 GPU 进行推理以增加总处理时间。 I succeeded running inference in single gpu, but failed to run on multiple GPUs.我成功地在单个 GPU 上运行推理，但未能在多个 GPU 上运行。

I thought dividing frames per number of gpus and processing inference would decrease the time.我认为按 GPU 数量划分帧和处理推理会减少时间。 If there is another way I can decrease running time, I would be glad to receive suggestions.如果有另一种方法可以减少运行时间，我很乐意收到建议。

I am using pre-trained model provided by Pytorch.我正在使用 Pytorch 提供的预训练模型。 What I tried is as follows:我的尝试如下：

1. I read the video and divide frames by number of gpus I have(currently two NVIDIA GeForce GTX 1080 Ti) 1.我阅读视频并按我拥有的gpu数量划分帧（目前两个NVIDIA GeForce GTX 1080 Ti）

2. Then, I distributed frames to gpus and process object detection inference. 2. 然后，我将帧分发到 gpu 并处理对象检测推理。
(Later I planned to use multi-threads to dynamically distribute frames per number of gpus, but currently I made it static) （后来我计划使用多线程来动态分配每个 gpu 数量的帧，但目前我将其设为静态）

The same method I tried worked well in Tensorflow using with tf.device() and I am trying to make it possible in Pytorch as well.我尝试过的相同方法在 Tensorflow 中使用with tf.device()运行良好，我也试图在 Pytorch 中使其成为可能。

pytorch_multithread.py pytorch_multithread.py

...
    def detection_gpu(frame_list, device_name, device, detect, model):
        model.to(device)
        model.eval()

        for frame in frame_list:
            start = time.time()
            detect.bounding_box_rcnn(frame, model=model)
            end = time.time()

            cv2.putText(frame, '{:.2f}ms'.format((end - start) * 1000), (40, 40), cv2.FONT_HERSHEY_SIMPLEX, 0.75,
                        (255, 0, 0),
                        2)

            cv2.imshow(str(device_name), frame)

            if cv2.waitKey(1) & 0xFF == ord('q'):
                break


    def main():
        args = arg_parse()

        VIDEO_PATH = args.video

        print("Loading network.....")
        model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
        print("Network successfully loaded")

        num_gpus = torch.cuda.device_count()
        if torch.cuda.is_available() and num_gpus > 1:
            device = ["cuda:{}".format(i) for i in range(num_gpus)]
        elif num_gpus == 1:
            device = "cuda"
        else:
            device = "cpu"
        # class names ex) person, car, truck, and etc.
        PATH_TO_LABELS = "labels/mscoco_labels.names"

        # load detection class, default confidence threshold is 0.5
        if num_gpus>1:
            detect = [DetectBoxes(PATH_TO_LABELS, device[i], conf_threshold=args.confidence) for i in range(num_gpus)]
        else:
            detect = [DetectBoxes(PATH_TO_LABELS, device, conf_threshold=args.confidence) for i in range(1)]

        cap = cv2.VideoCapture(VIDEO_PATH)

        # find number of gpus that is available
        frame_length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

        # TODO: CPU환경 고려하기
        # divide frames of video by number of gpus
        div = frame_length // num_gpus
        divide_point = [i for i in range(frame_length) if i != 0 and i % div == 0]
        divide_point.pop()

        frame_list = []
        fragments = []
        count = 0
        while cap.isOpened():
            hasFrame, frame = cap.read()
            if not hasFrame:
                frame_list.append(fragments)
                break
            if count in divide_point:
                frame_list.append(fragments)
                fragments = []
            fragments.append(frame)
            count += 1
        cap.release()


        detection_gpu(frame_list[0], 0, device[0], detect[0], model)
        detection_gpu(frame_list[1], 1, device[1], detect[1], model)
        # Process object detection using threading
        # thread_detection = [ThreadWithReturnValue(target=detection_gpu,
        #                                           args=(frame_list[i], i, detect, model))
        #                     for i in range(num_gpus)]
        #
        #
        # final_list = []
        # # Begin operating threads
        # for th in thread_detection:
        #     th.start()
        #
        # # Once tasks are completed get return value (frames) and put to new list
        # for th in thread_detection:
        #     final_list.extend(th.join())
        cv2.destroyAllWindows()

detection_boxes_pytorch.py detection_boxes_pytorch.py

    def bounding_box_rcnn(self, frame, model):
            print(self.device)
            # Image is converted to image Tensor
            transform = transforms.Compose([transforms.ToTensor()])
            img = transform(frame).to(self.device)
            with torch.no_grad():
                # The image is passed through model to get predictions
                pred = model([img])

            # classes, bounding boxes, confidence scores are gained
            # only classes and bounding boxes > confThershold are passed to draw_boxes
            pred_class = [self.classes[i] for i in list(pred[0]['labels'].cpu().clone().numpy())]
            pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().cpu().clone().numpy())]
            pred_score = list(pred[0]['scores'].detach().cpu().clone().numpy())
            pred_t = [pred_score.index(x) for x in pred_score if x > self.confThreshold][-1]
            pred_colors = [i for i in list(pred[0]['labels'].cpu().clone().numpy())]
            pred_boxes = pred_boxes[:pred_t + 1]
            pred_class = pred_class[:pred_t + 1]

            for i in range(len(pred_boxes)):
                left = int(pred_boxes[i][0][0])
                top = int(pred_boxes[i][0][1])
                right = int(pred_boxes[i][1][0])
                bottom = int(pred_boxes[i][1][1])

                color = STANDARD_COLORS[pred_colors[i] % len(STANDARD_COLORS)]

                self.draw_boxes(frame, pred_class[i], pred_score[i], left, top, right, bottom, color)

The error I get is as follows:我得到的错误如下：

    Traceback (most recent call last):
      File "C:/Users/username/Desktop/Object_Detection_Video_AllInOne/pytorch_multithread.py", line 133, in <module>
        main()
      File "C:/Users/username/Desktop/Object_Detection_Video_AllInOne/pytorch_multithread.py", line 113, in main
        detection_gpu(frame_list[1], 1, device[1], detect[1], model)
      File "C:/Users/username/Desktop/Object_Detection_Video_AllInOne/pytorch_multithread.py", line 39, in detection_gpu
        detect.bounding_box_rcnn(frame, model=model)
      File "C:\Users\username\Desktop\Object_Detection_Video_AllInOne\p_utils\detection_boxes_pytorch.py", line 64, in bounding_box_rcnn
        pred = model([img])
      File "C:\Users\username\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "C:\Users\username\AppData\Local\Programs\Python\Python37\lib\site-packages\torchvision\models\detection\generalized_rcnn.py", line 51, in forward
        proposals, proposal_losses = self.rpn(images, features, targets)
      File "C:\Users\username\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 493, in __call__
        result = self.forward(*input, **kwargs)
      File "C:\Users\username\AppData\Local\Programs\Python\Python37\lib\site-packages\torchvision\models\detection\rpn.py", line 409, in forward
        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
      File "C:\Users\username\AppData\Local\Programs\Python\Python37\lib\site-packages\torchvision\models\detection\_utils.py", line 168, in decode
        rel_codes.reshape(sum(boxes_per_image), -1), concat_boxes
      File "C:\Users\username\AppData\Local\Programs\Python\Python37\lib\site-packages\torchvision\models\detection\_utils.py", line 199, in decode_single
        pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
    RuntimeError: binary_op(): expected both inputs to be on same device, but input a is on cuda:1 and input b is on cuda:0

Answer 1

Pytorch provides DataParallel module to run a model on mutiple GPUs. Pytorch 提供了 DataParallel 模块来在多个 GPU 上运行模型。 Detailed documentation of DataParallel and toy example can be found here and here .可以在此处和此处找到 DataParallel 和玩具示例的详细文档。

使用多 GPU 和多线程、Pytorch 的对象检测推理

问题描述

pytorch_multithread.py pytorch_multithread.py

detection_boxes_pytorch.py detection_boxes_pytorch.py

1 个解决方案

解决方案1
1 2019-07-30 05:51:39

使用多 GPU 和多线程、Pytorch 的对象检测推理

问题描述

pytorch_multithread.py pytorch_multithread.py

detection_boxes_pytorch.py detection_boxes_pytorch.py

1 个解决方案

解决方案1 1 2019-07-30 05:51:39

解决方案1
1 2019-07-30 05:51:39