简体   繁体   中英

why it's so slow in data exchanging between CPU and GPU memory?

It's my first time using openCL on ARM(CPU:Qualcomm Snapdragon MSM8930, GPU:Adreno(TM)305).

I find using openCL is really very effective, but data exchanging between CPU and GPU takes too much time, as much as I can't imaging.

Here is an example:

cv::Mat mat(640,480,CV_8UC3,cv::Scalar(0,0,0));
cv::ocl::oclMat mat_ocl;

//cpu->gpu
mat_ocl.upload(mat);
//gpu->cpu
mat = (cv::Mat)mat_ocl;

Just a small image like this, the upload option takes 10ms, and download option takes 20ms! That takes too long.

Can anyone could tell me is this situation normal? Or something goes wrong here?

Thank you in advance!

added:

my messuring method is

clock_t start,end;
start=clock();
mat_ocl.upload(mat);
end = clock();
__android_log_print(ANDROID_LOG_INFO,"tag","upload time = %f s",(double)(end-start)/CLOCKS_PER_SEC);

Actually, I'm not using openCL exactly, but ocl module in openCV(although it says they are equal). When reading openCV documents, I find it's just tell us to transform cv::Mat to cv::ocl::oclMat (which is data uploading from CPU to GPU)to do GPU calculation, but I haven't found memory mapping method in the ocl module documents.

Well, I found some useful introductions in openCV doc:

In a heterogeneous device environment, there may be cost associated with data transfer. This would be the case, for example, when data needs to be moved from host memory (accessible to the CPU), to device memory (accessible to a discrete GPU). in the case of integrated graphics chips, there may be performance issues, relating to memory coherency between access from the GPU “part” of the integrated device, or the CPU “part.” For best performance, in either case, it is recommended that you do not introduce data transfers between CPU and the discrete GPU, except in the beginning and the end of the algorithmic pipeline.

So, it seems explain the reason why speed of data transfer between CPU and GPU is so slow. But I still don't know how to fix this issue.

Provide exact measuring methods and results.

From experience of OpenCL development under ARM platforms (not Qcom, though), I can say that you shouldn't expect much of read-write operations. Memory bus is usually like 64bit, plus DDR3 isn't that fast.

Use shared memory for your advantage - go for mapping/unmapping instead of read/write.

PS actual operation time is measured, using cl_event profiling:

cl_ulong getTimeNanoSeconds(cl_event event)
{
    cl_ulong start = 0, end = 0;

    cl_int ret = clWaitForEvents(1, &event);
    if (ret != CL_SUCCESS)
        throw(ret);

    ret = clGetEventProfilingInfo(
              event,
              CL_PROFILING_COMMAND_START,
              sizeof(cl_ulong),
              &start,
              NULL);
    if (ret != CL_SUCCESS)
        throw(ret);

    ret = clGetEventProfilingInfo(
              event,
              CL_PROFILING_COMMAND_END,
              sizeof(cl_ulong),
              &end,
              NULL);
    if (ret != CL_SUCCESS)
        throw(ret);

    return (end - start);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM