为什么当我使用 CPU 而不是 GPU 时 cv2.dnn 工作得更快？

Question

I am new to openCV - CUDA so I have been testing the most simple one which is loading a model on GPU rather than CPU to see how fast GPU is and I am horrified at the result I get. I am new to openCV - CUDA so I have been testing the most simple one which is loading a model on GPU rather than CPU to see how fast GPU is and I am horrified at the result I get.

----------------------------------------------------------------
---         GPU                vs             CPU            ---
---                                                          ---
--- 21.913758993148804 seconds ---3.0586464405059814 seconds ---
--- 22.379303455352783 seconds ---3.1384341716766357 seconds ---
--- 21.500431060791016 seconds ---2.9400241374969482 seconds ---
--- 21.292986392974854 seconds ---3.3738017082214355 seconds ---
--- 20.88358211517334 seconds  ---3.388749599456787 seconds  ---

I will give my code snippet in case I may be doing something wrong that cause GPU time to spike so high.我将给出我的代码片段，以防我可能做错了什么导致 GPU 时间飙升如此之高。

def loadYolo():
    net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
    
    net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
    net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

    classes = []
    with open("coco.names", "r") as f:
        classes = [line.strip() for line in f.readlines()]

    layer_names = net.getLayerNames()
    output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
    return net,classes,layer_names,output_layers


@socketio.on('image')
def image(data_image):

    sbuf = StringIO()
    sbuf.write(data_image)
    
    b = io.BytesIO(base64.b64decode(data_image))
    if(str(data_image) == 'data:,'):
        pass
    else:
        pimg = Image.open(b)
    
        frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
        frame = resize(frame, width=700)
        frame = cv2.flip(frame, 1)
    
        net,classes,layer_names,output_layers=loadYolo()
        height, width, channels = frame.shape

        
        blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
        swapRB=True, crop=False)

       
        net.setInput(blob)
        outs = net.forward(output_layers)
        print("--- %s seconds ---" % (time.time() - start_time))
        
        
        class_ids = []
        confidences = []
        boxes = []
        for out in outs:
            for detection in out:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                if confidence > 0.5:
                    # Object detected
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)

                    # Rectangle coordinates
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)

                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
                    class_ids.append(class_id)

        indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        font = cv2.FONT_HERSHEY_PLAIN
        colors = np.random.uniform(0, 255, size=(len(classes), 3))
        for i in range(len(boxes)):
            if i in indexes:
                x, y, w, h = boxes[i]
                label = str(classes[class_ids[i]])
                color = colors[class_ids[i]]
                cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
                cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
    
        imgencode = cv2.imencode('.jpg', frame)[1]

        stringData = base64.b64encode(imgencode).decode('utf-8')
        b64_src = 'data:image/jpg;base64,'
        stringData = b64_src + stringData
        emit('response_back', stringData)

My Gpu is Nvidia 1050 Ti and my CPU is i5 gen 9 in case someone need the specification.我的 Gpu 是 Nvidia 1050 Ti，我的 CPU 是 i5 gen 9，以防有人需要规格。 Can someone please enlighten me as I am super confused right now?有人可以启发我，因为我现在非常困惑吗？ Thank you very much非常感谢

EDIT 1: I tried to use cv2.dnn.DNN_TARGET_CUDA instead of cv2.dnn.DNN_TARGET_CUDA_FP16, but the time is still terrible compare to CPU.编辑 1：我尝试使用 cv2.dnn.DNN_TARGET_CUDA 而不是 cv2.dnn.DNN_TARGET_CUDA_FP16，但与 CPU 相比，时间仍然很糟糕。 Below is the GPU result:下面是 GPU 结果：

--- 10.91195559501648 seconds ---
--- 11.344025135040283 seconds ---
--- 11.754926204681396 seconds ---
--- 12.779674530029297 seconds ---

Below is CPU result:以下是CPU结果：

--- 4.780993223190308 seconds ---
--- 4.910650253295898 seconds ---
--- 4.990436553955078 seconds ---
--- 5.246175050735474 seconds ---

it is still slower than CPU它仍然比 CPU 慢

EDIT 2: OpenCv is 4.5.0, CUDA 11.1 and CUDNN 8.0.1编辑 2： OpenCv 是 4.5.0，CUDA 11.1 和 CUDNN 8.0.1

Answer 1

DNN_TARGET_CUDA_FP16 refers to 16-bit floating-point. DNN_TARGET_CUDA_FP16指的是 16 位浮点数。 since your gpu is 1050 Ti, your gpu seems not works too well with FP16.you can check it from here and your compute capability from here .由于您的 gpu 为 1050 Ti，因此您的 gpu 似乎不适用于 FP16。您可以从此处检查它，并从此处检查您的计算能力。 i think you should change this line:我认为你应该改变这一行：

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

into:进入：

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

Answer 2

You should definitely only load YOLO once.你绝对应该只加载一次 YOLO。 Recreating it for every image that comes through the socket is slow for both CPU and GPU, but GPU takes longer to initially load which is why you're seeing it run slower than CPU.对于 CPU 和 GPU 来说，为通过套接字的每个图像重新创建它都很慢，但 GPU 需要更长的时间来初始加载，这就是为什么您看到它的运行速度比 CPU 慢。

I don't understand what you mean by using an LRU cache for your YOLO model.我不明白您为 YOLO model 使用 LRU 缓存是什么意思。 Without seeing the rest of your code structure I can't make any real suggestions, but can you try at least temporarily putting the network into the global space just to see if it runs faster?在没有看到您的代码结构的 rest 的情况下，我无法提出任何真正的建议，但是您是否可以尝试至少暂时将网络放入全局空间以查看它是否运行得更快？ (remove the function altogether and put its body in the global space) （把 function 全部去掉，把它的主体放在全局空间中）

something like this像这样的东西

net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")

net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

classes = []
with open("coco.names", "r") as f:
    classes = [line.strip() for line in f.readlines()]

layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]


@socketio.on('image')
def image(data_image):

    sbuf = StringIO()
    sbuf.write(data_image)
    
    b = io.BytesIO(base64.b64decode(data_image))
    if(str(data_image) == 'data:,'):
        pass
    else:
        pimg = Image.open(b)
    
        frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
        frame = resize(frame, width=700)
        frame = cv2.flip(frame, 1)
    
        height, width, channels = frame.shape

        
        blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
        swapRB=True, crop=False)

       
        net.setInput(blob)
        outs = net.forward(output_layers)
        print("--- %s seconds ---" % (time.time() - start_time))
        
        
        class_ids = []
        confidences = []
        boxes = []
        for out in outs:
            for detection in out:
                scores = detection[5:]
                class_id = np.argmax(scores)
                confidence = scores[class_id]
                if confidence > 0.5:
                    # Object detected
                    center_x = int(detection[0] * width)
                    center_y = int(detection[1] * height)
                    w = int(detection[2] * width)
                    h = int(detection[3] * height)

                    # Rectangle coordinates
                    x = int(center_x - w / 2)
                    y = int(center_y - h / 2)

                    boxes.append([x, y, w, h])
                    confidences.append(float(confidence))
                    class_ids.append(class_id)

        indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
        font = cv2.FONT_HERSHEY_PLAIN
        colors = np.random.uniform(0, 255, size=(len(classes), 3))
        for i in range(len(boxes)):
            if i in indexes:
                x, y, w, h = boxes[i]
                label = str(classes[class_ids[i]])
                color = colors[class_ids[i]]
                cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
                cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
    
        imgencode = cv2.imencode('.jpg', frame)[1]

        stringData = base64.b64encode(imgencode).decode('utf-8')
        b64_src = 'data:image/jpg;base64,'
        stringData = b64_src + stringData
        emit('response_back', stringData)

Answer 3

From the previous two answer I manage to get the solution changing:从前两个答案中，我设法改变了解决方案：

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

into:进入：

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

have help to twice the GPU speed due to my GPU type is not compatible with FP16 this is thanks to Amir Karami and also despite Ian Chu answer did not solve my problem it give me basis to forcefully make all the images to only use one net instances this actually lower the processing time significantly from each needing 10 second into 0.03-0.04 seconds thus surpassing CPU speed by many times.由于我的 GPU 类型与 FP16 不兼容，有助于将 GPU 速度提高一倍，这要归功于 Amir Karami，尽管 Ian Chu 的回答并没有解决我的问题，但它给了我强制使所有图像仅使用一个网络实例的基础这实际上将处理时间从每个需要 10 秒显着降低到 0.03-0.04 秒，从而超过 CPU 速度很多倍。 The reason I did not accept both answer because neither really solve my problem but both become strong basis to my solution so I still upvote them.我不接受这两个答案的原因是因为两者都没有真正解决我的问题，但两者都成为我解决方案的坚实基础，所以我仍然支持它们。 I just leave my answer here in case anyone encounter this problem like me.我只是在这里留下我的答案，以防有人遇到像我这样的问题。

为什么当我使用 CPU 而不是 GPU 时 cv2.dnn 工作得更快？

问题描述

3 个解决方案

解决方案1
1 2021-04-27 04:26:08

解决方案2
1 2021-04-27 14:50:48

解决方案3
1 2021-04-30 04:19:42

为什么当我使用 CPU 而不是 GPU 时 cv2.dnn 工作得更快？

问题描述

3 个解决方案

解决方案1 1 2021-04-27 04:26:08

解决方案2 1 2021-04-27 14:50:48

解决方案3 1 2021-04-30 04:19:42

解决方案1
1 2021-04-27 04:26:08

解决方案2
1 2021-04-27 14:50:48

解决方案3
1 2021-04-30 04:19:42