Jetson TK1上的CUDA零拷贝与CudaMemcpy

Question

My Question: I am looking for someone to either point out a mistake in the way I am attempting to use implement zero-copy in CUDA, or reveal a more 'behind the scenes' perspective to why the zero-copy method would not be faster than memcpy method. 我的问题：我正在寻找某人要么指出我试图在CUDA中使用实现零拷贝的方式中的错误，要么揭示更多“幕后”视角为什么零拷贝方法不会更快比memcpy方法。 By the way, I am performing my tests on NVidia's TK1 processor, using Ubuntu. 顺便说一下，我正在使用Ubuntu对NVidia的TK1处理器进行测试。

My problem has to do with efficiently using NVIDIA TK1's (physically) unified memory architecture with CUDA. 我的问题与使用CIDA有效地使用NVIDIA TK1（物理）统一内存架构有关。 There are 2 methods NVIDIA provides for GPU/CPU memory transfer abstraction. NVIDIA提供了两种GPU / CPU内存传输抽象方法。

Unified Memory abstraction (using cudaHostAlloc & cudaHostGetDevicePointer) 统一内存抽象（使用cudaHostAlloc和cudaHostGetDevicePointer）
Explicit copy to host, and from device (using cudaMalloc() & cudaMemcpy) 显式复制到主机和设备（使用cudaMalloc（）＆cudaMemcpy）

Short description of my test code: I test out the same cuda kernel using both methods 1 and 2. I expected 1 to be faster given that there is no copy to device of the source data or copy from device of the result data. 我的测试代码的简短描述：我使用方法1和2测试了相同的cuda内核。鉴于源数据的设备没有复制或结果数据的设备没有复制，我预计1会更快。 However, results backwards to my assumption (method # 1 is 50% slower). 然而，结果倒退到我的假设（方法＃1慢50％）。 Below is my code for this test: 以下是我测试的代码：

#include <libfreenect/libfreenect.hpp>
#include <iostream>
#include <vector>
#include <cmath>
#include <pthread.h>
#include <cxcore.h>
#include <time.h>
#include <sys/time.h>
#include <memory.h>
///CUDA///
#include <cuda.h>
#include <cuda_runtime.h>

 ///OpenCV 2.4
#include <highgui.h>
#include <cv.h>
#include <opencv2/gpu/gpu.hpp>

using namespace cv;
using namespace std;

///The Test Kernel///
__global__ void cudaCalcXYZ( float *dst, float *src, float *M, int height, int width, float scaleFactor, int minDistance)
{
    float nx,ny,nz, nzpminD, jFactor;
    int heightCenter = height / 2;
    int widthCenter = width / 2;
    //int j = blockIdx.x;   //Represents which row we are in
    int index = blockIdx.x*width;
    jFactor = (blockIdx.x - heightCenter)*scaleFactor;
    for(int i= 0; i < width; i++)
    {
        nz = src[index];
        nzpminD = nz + minDistance;
        nx = (i - widthCenter )*(nzpminD)*scaleFactor;      
        ny = (jFactor)*(nzpminD);   
        //Solve for only Y matrix (height vlaues)           
         dst[index++] = nx*M[4] + ny*M[5] + nz*M[6];
        //dst[index++] = 1 + 2 + 3;
    }
}

//Function fwd declarations
double getMillis();
double getMicros();
void runCudaTestZeroCopy(int iter, int cols, int rows);
void runCudaTestDeviceCopy(int iter, int cols, int rows);

int main(int argc, char **argv) {

    //ZERO COPY FLAG (allows runCudaTestZeroCopy to run without fail)
    cudaSetDeviceFlags(cudaDeviceMapHost);

    //Runs kernel using explicit data copy to 'device' and back from 'device'
    runCudaTestDeviceCopy(20, 640,480);
    //Uses 'unified memory' cuda abstraction so device can directly work from host data
    runCudaTestZeroCopy(20,640, 480);

    std::cout << "Stopping test" << std::endl;

    return 0;
}

void runCudaTestZeroCopy(int iter, int cols, int rows)
{
    cout << "CUDA Test::ZEROCOPY" << endl;
        int src_rows = rows;
        int src_cols = cols;
        int m_rows = 4;
        int m_cols = 4;
        int dst_rows = src_rows;
        int dst_cols = src_cols;
        //Create and allocate memory for host mats pointers
        float *psrcMat;
        float *pmMat;
        float *pdstMat;
        cudaHostAlloc((void **)&psrcMat, src_rows*src_cols*sizeof(float), cudaHostAllocMapped);
        cudaHostAlloc((void **)&pmMat, m_rows*m_cols*sizeof(float), cudaHostAllocMapped);
        cudaHostAlloc((void **)&pdstMat, dst_rows*dst_cols*sizeof(float), cudaHostAllocMapped);
        //Create mats using host pointers
        Mat src_mat = Mat(cvSize(src_cols, src_rows), CV_32FC1, psrcMat);
        Mat m_mat   = Mat(cvSize(m_cols, m_rows), CV_32FC1, pmMat);
        Mat dst_mat = Mat(cvSize(dst_cols, dst_rows), CV_32FC1, pdstMat);

        //configure src and m mats
        for(int i = 0; i < src_rows*src_cols; i++)
        {
            psrcMat[i] = (float)i;
        }
        for(int i = 0; i < m_rows*m_cols; i++)
        {
            pmMat[i] = 0.1234;
        }
        //Create pointers to dev mats
        float *d_psrcMat;
        float *d_pmMat;
        float *d_pdstMat;
        //Map device to host pointers
        cudaHostGetDevicePointer((void **)&d_psrcMat, (void *)psrcMat, 0);
        //cudaHostGetDevicePointer((void **)&d_pmMat, (void *)pmMat, 0);
        cudaHostGetDevicePointer((void **)&d_pdstMat, (void *)pdstMat, 0);
        //Copy matrix M to device
        cudaMalloc( (void **)&d_pmMat, sizeof(float)*4*4 ); //4x4 matrix
        cudaMemcpy( d_pmMat, pmMat, sizeof(float)*m_rows*m_cols, cudaMemcpyHostToDevice);

        //Additional Variables for kernels
        float scaleFactor = 0.0021;
        int minDistance = -10;

        //Run kernel! //cudaSimpleMult( float *dst, float *src, float *M, int width, int height)
        int blocks = src_rows;
        const int numTests = iter;
        double perfStart = getMillis();

        for(int i = 0; i < numTests; i++)
        {           
            //cudaSimpleMult<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_cols, src_rows);
            cudaCalcXYZ<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_rows, src_cols, scaleFactor, minDistance);
            cudaDeviceSynchronize();
        }
        double perfStop = getMillis();
        double perfDelta = perfStop - perfStart;
        cout << "Ran " << numTests << " iterations totaling " << perfDelta << "ms" << endl;
        cout << " Average time per iteration: " << (perfDelta/(float)numTests) << "ms" << endl;

        //Copy result back to host
        //cudaMemcpy(pdstMat, d_pdstMat, sizeof(float)*src_rows*src_cols, cudaMemcpyDeviceToHost);
        //cout << "Printing results" << endl;
        //for(int i = 0; i < 16*16; i++)
        //{
        //  cout << "src[" << i << "]= " << psrcMat[i] << " dst[" << i << "]= " << pdstMat[i] << endl;
        //}

        cudaFree(d_psrcMat);
        cudaFree(d_pmMat);
        cudaFree(d_pdstMat);
        cudaFreeHost(psrcMat);
        cudaFreeHost(pmMat);
        cudaFreeHost(pdstMat);
}

void runCudaTestDeviceCopy(int iter, int cols, int rows)
{
        cout << "CUDA Test::DEVICE COPY" << endl;
        int src_rows = rows;
        int src_cols = cols;
        int m_rows = 4;
        int m_cols = 4;
        int dst_rows = src_rows;
        int dst_cols = src_cols;
        //Create and allocate memory for host mats pointers
        float *psrcMat;
        float *pmMat;
        float *pdstMat;
        cudaHostAlloc((void **)&psrcMat, src_rows*src_cols*sizeof(float), cudaHostAllocMapped);
        cudaHostAlloc((void **)&pmMat, m_rows*m_cols*sizeof(float), cudaHostAllocMapped);
        cudaHostAlloc((void **)&pdstMat, dst_rows*dst_cols*sizeof(float), cudaHostAllocMapped);
        //Create pointers to dev mats
        float *d_psrcMat;
        float *d_pmMat;
        float *d_pdstMat;
        cudaMalloc( (void **)&d_psrcMat, sizeof(float)*src_rows*src_cols ); 
        cudaMalloc( (void **)&d_pdstMat, sizeof(float)*src_rows*src_cols );
        cudaMalloc( (void **)&d_pmMat, sizeof(float)*4*4 ); //4x4 matrix
        //Create mats using host pointers
        Mat src_mat = Mat(cvSize(src_cols, src_rows), CV_32FC1, psrcMat);
        Mat m_mat   = Mat(cvSize(m_cols, m_rows), CV_32FC1, pmMat);
        Mat dst_mat = Mat(cvSize(dst_cols, dst_rows), CV_32FC1, pdstMat);

        //configure src and m mats
        for(int i = 0; i < src_rows*src_cols; i++)
        {
            psrcMat[i] = (float)i;
        }
        for(int i = 0; i < m_rows*m_cols; i++)
        {
            pmMat[i] = 0.1234;
        }

        //Additional Variables for kernels
        float scaleFactor = 0.0021;
        int minDistance = -10;

        //Run kernel! //cudaSimpleMult( float *dst, float *src, float *M, int width, int height)
        int blocks = src_rows;

        double perfStart = getMillis();
        for(int i = 0; i < iter; i++)
        {           
            //Copty from host to device
            cudaMemcpy( d_psrcMat, psrcMat, sizeof(float)*src_rows*src_cols, cudaMemcpyHostToDevice);
            cudaMemcpy( d_pmMat, pmMat, sizeof(float)*m_rows*m_cols, cudaMemcpyHostToDevice);
            //Run Kernel
            //cudaSimpleMult<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_cols, src_rows);
            cudaCalcXYZ<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_rows, src_cols, scaleFactor, minDistance);
            //Copy from device to host
            cudaMemcpy( pdstMat, d_pdstMat, sizeof(float)*src_rows*src_cols, cudaMemcpyDeviceToHost);
        }
        double perfStop = getMillis();
        double perfDelta = perfStop - perfStart;
        cout << "Ran " << iter << " iterations totaling " << perfDelta << "ms" << endl;
        cout << " Average time per iteration: " << (perfDelta/(float)iter) << "ms" << endl;

        cudaFree(d_psrcMat);
        cudaFree(d_pmMat);
        cudaFree(d_pdstMat);
        cudaFreeHost(psrcMat);
        cudaFreeHost(pmMat);
        cudaFreeHost(pdstMat);
}

//Timing functions for performance measurements
double getMicros()
{
    timespec ts;
    //double t_ns, t_s;
    long t_ns;
    double t_s;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    t_s = (double)ts.tv_sec;
    t_ns = ts.tv_nsec;
    //return( (t_s *1000.0 * 1000.0) + (double)(t_ns / 1000.0) );
    return ((double)t_ns / 1000.0);
}

double getMillis()
{
    timespec ts;
    double t_ns, t_s;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    t_s = (double)ts.tv_sec;
    t_ns = (double)ts.tv_nsec;
    return( (t_s * 1000.0) + (t_ns / 1000000.0) );
}

I have already seen the post Cuda zero-copy performance , but I feel this was not related for the following reason: The GPU and CPUs have a physically unified memory architecture. 我已经看过后Cuda零拷贝性能，但我觉得这与以下原因无关：GPU和CPU具有物理统一的内存架构。

Thanks 谢谢

Answer 1

When you are using ZeroCopy, the read to memory goes through some path where it queries the memory unit to fetch data from system memory. 当您使用ZeroCopy时，对内存的读取会经过一些路径，在该路径中，它会查询内存单元以从系统内存中获取数据。 This operation has some latency. 此操作有一些延迟。

When using direct access to memory, the memory unit gathers data from global memory, and has a different access pattern and latency. 当使用直接访问存储器时，存储器单元从全局存储器收集数据，并具有不同的访问模式和延迟。

Actually seeing this difference would require some form of profiling. 实际上看到这种差异需要某种形式的分析。

Nonetheless, your call to global function makes use of a single thread 尽管如此，您对全局函数的调用使用了单个线程

cudaCalcXYZ<<< blocks,1 >>> (...

In this case, the GPU has little way to hide latency when memory is gathered from the system memory (or global memory). 在这种情况下，当从系统内存（或全局内存）收集内存时，GPU几乎无法隐藏延迟。 I would recommend you use more threads (some multiple of 64, at least 128 total), and run the profiler on it to get the cost of memory access. 我建议你使用更多的线程（64的一些，总共至少128），并在其上运行探查器以获得内存访问的成本。 Your algorithm seems separable, and modifing the code from 您的算法似乎是可分离的，并修改了代码

for(int i= 0; i < width; i++)

to 至

for (int i = threadIdx.x ; i < width ; i += blockDim.x)

will probably increase performance overall. 可能会提高整体表现。 Image size is 640 in width which will turn into 5 iterations of 128 threads. 图像大小为640，将变为128个线程的5次迭代。

cudaCalcXYZ<<< blocks,128 >>> (...

I believe it would result in some performance increase. 我相信这会带来一些性能提升。

Answer 2

ZeroCopy feature allow us to running data on device without manually copy it to Device Memory like cudaMemcpy function. ZeroCopy功能允许我们在设备上运行数据，而无需手动将其复制到设备内存，如cudaMemcpy功能。 Zero copy memory only pass host address to device that read/wrote on kernel device. 零拷贝存储器仅将主机地址传递给在内核设备上读/写的设备。 So, the more thread block you declaration to kernel device, the more data that read/wrote on kernel device, the more host address that passed to device. 因此，您向内核设备声明的线程块越多，在内核设备上读取/写入的数据越多，传递给设备的主机地址就越多。 Finally, you got better performance gain than if you only declaration a few thread block to device kernel. 最后，与仅向设备内核声明一些线程块相比，您获得了更好的性能提升。

Jetson TK1上的CUDA零拷贝与CudaMemcpy

问题描述

2 个解决方案

解决方案1
1 2016-04-22 12:07:01

解决方案2
1 2016-10-26 07:59:05

Jetson TK1上的CUDA零拷贝与CudaMemcpy

问题描述

2 个解决方案

解决方案1 1 2016-04-22 12:07:01

解决方案2 1 2016-10-26 07:59:05

解决方案1
1 2016-04-22 12:07:01

解决方案2
1 2016-10-26 07:59:05