简体   繁体   中英

Memory copy is taking more time on GPU compared to CPU

I have a source and destination pointers of the image to copy. When I run the code for the copy on CPU, its taking 2ms. Now,I ran code on open cl with:

clCreateBuffer(context,CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,size,src_ptr,errcode_ret)
clCreateBuffer(context,CL_MEM_USE_HOST_PTR|CL_MEM_READ_WRITE,size,dst_ptr,errcode_ret)

and written kernel with global workgroup size(w,H).so, each kernel is copying a pixel. It's about 20ms.

Can someone please help me, how to efficiently do memory copy on open cl when we have image pointers to global memory.what is proper workgroup size to use for this process?

Can you help clarify what you're trying to accomplish? Are you trying to compare the time it takes to memcpy a host buffer to the time it takes to copy a device buffer using a GPU kernel?

If so, try allocating the buffer without the CL_MEM_USE_HOST_PTR flag. From the first response here it seems like some implementations map that buffer to system memory instead of device memory, which could slow down the copy kernel.

how to efficiently do memory copy on open cl when we have image pointers to global memory

The efficient way is to use memcpy() on the host pointers. IOW use the CPU.

when we use CL_MEM_USE_HOST_PTR, GPU can access the image directly from global memory instead of copying from global memory

That's not strictly true. It's true for integrated GPUs (if the host_ptr memory pointer is properly aligned). Discrete GPUs will still copy host memory to their own memory over the PCI express bus. If you read the documentation for clCreateBuffer, it says:

CL_MEM_USE_HOST_PTR ... OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.

Discrete GPUs cannot directly "work" on host memory. Even if they could, it would be so slow as to be pointless.

In fact using CL_MEM_USE_HOST_PTR with a discrete GPU may result in worse performance, because the GPU will have to keep the host copy in sync with its own copy, which will result in a lot of PCIe transfers. CL_MEM_USE_HOST_PTR only makes sense with integrated GPUs to save unnecessary transfers and memory copies.

Generally the way you work with GPUs is to minimize memory transfers, so you create buffers once (with clCreateBuffer), then launch the kernels you need on them, and then either transfer result back to host (via enqueueReadImage) or display it with OpenGL interop. You'll have to clarify what you're doing if you want more useful advice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM