在我的OpenCL / Cloo（C＃）程序中，“零复制”比非零复制慢

Question

This may simply be an issue with the memory objects being allocated by the .NET framework not being properly page-aligned, but I cannot see why zero-copy is slower for me than non-zero copy. 这可能只是一个问题，.NET框架分配的内存对象未正确进行页面对齐，但是我看不到为什么零拷贝对我来说比非零拷贝要慢。

I'll include code inline in this question, but the complete source can be seen here: https://github.com/kwende/ClooMatrixMultiply/blob/master/GiantMatrixOnGPU/GPUMatrixMultiplier.cs . 我将在此问题中内联代码，但完整的源代码可以在这里看到： https : //github.com/kwende/ClooMatrixMultiply/blob/master/GiantMatrixOnGPU/GPUMatrixMultiplier.cs 。

Since this is my first attempt at getting zero-copy working, I wrote up a simple matrix multiplication example. 由于这是我首次尝试实现零拷贝工作，因此我编写了一个简单的矩阵乘法示例。 I first initialize my OpenCL objects: 我首先初始化我的OpenCL对象：

    private void Initialize()
    {
        // get the intel integrated GPU
        _integratedIntelGPUPlatform = ComputePlatform.Platforms.Where(n => n.Name.Contains("Intel")).First();

        // create the compute context. 
        _context = new ComputeContext(
            ComputeDeviceTypes.Gpu, // use the gpu
            new ComputeContextPropertyList(_integratedIntelGPUPlatform), // use the intel openCL platform
            null,
            IntPtr.Zero);

        // the command queue is the, well, queue of commands sent to the "device" (GPU)
        _commandQueue = new ComputeCommandQueue(
            _context, // the compute context
            _context.Devices[0], // first device matching the context specifications
            ComputeCommandQueueFlags.None); // no special flags

        string kernelSource = null;
        using (StreamReader sr = new StreamReader("kernel.cl"))
        {
            kernelSource = sr.ReadToEnd();
        }

        // create the "program"
        _program = new ComputeProgram(_context, new string[] { kernelSource });

        // compile. 
        _program.Build(null, null, null, IntPtr.Zero);
        _kernel = _program.CreateKernel("ComputeMatrix");
    }

...this is only executed once if my code hasn't been initialized. ...这仅在我的代码尚未初始化的情况下执行一次。 Then I get into the main body. 然后我进入主体。 For non-zero copy, I do the following: 对于非零副本，我执行以下操作：

  public float[] MultiplyMatrices(float[] matrix1, float[] matrix2,
  int matrix1Height, int matrix1WidthMatrix2Height, int matrix2Width)
  {
        if (!_initialized)
        {
            Initialize();
            _initialized = true;
        }

        ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix1);
        _kernel.SetMemoryArgument(0, matrix1Buffer);

        ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix2);
        _kernel.SetMemoryArgument(1, matrix2Buffer);

        float[] ret = new float[matrix1Height * matrix2Width];
        ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer,
            ret);
        _kernel.SetMemoryArgument(2, retBuffer);

        _kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
        _kernel.SetValueArgument<int>(4, matrix2Width);

        _commandQueue.Execute(_kernel,
            new long[] { 0 },
            new long[] { matrix2Width, matrix1Height },
            null, null);

        unsafe
        {
            fixed (float* retPtr = ret)
            {
                _commandQueue.Read(retBuffer,
                    false, 0,
                    ret.Length,
                    new IntPtr(retPtr),
                    null);

                _commandQueue.Finish();
            }
        }

        matrix1Buffer.Dispose();
        matrix2Buffer.Dispose();
        retBuffer.Dispose();

        return ret;
    }

You can see how I'm explicitly setting CopyHostPointer for all of my ComputeBuffer allocations. 您可以看到我如何为所有ComputeBuffer分配显式设置CopyHostPointer。 This executes fine. 执行良好。

I then do the following adjustment to (which includes setting "UseHostPointer" and calling Map/Unmap instead of Read): 然后，我进行以下调整（包括设置“ UseHostPointer”并调用Map / Unmap而不是Read）：

    public float[] MultiplyMatricesZeroCopy(float[] matrix1, float[] matrix2,
        int matrix1Height, int matrix1WidthMatrix2Height, int matrix2Width)
    {
        if (!_initialized)
        {
            Initialize();
            _initialized = true;
        }

        ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix1);
        _kernel.SetMemoryArgument(0, matrix1Buffer);

        ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix2);
        _kernel.SetMemoryArgument(1, matrix2Buffer);

        float[] ret = new float[matrix1Height * matrix2Width];
        ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
            ret);
        _kernel.SetMemoryArgument(2, retBuffer);

        _kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
        _kernel.SetValueArgument<int>(4, matrix2Width);

        _commandQueue.Execute(_kernel,
            new long[] { 0 },
            new long[] { matrix2Width, matrix1Height },
            null, null);

        IntPtr retPtr = _commandQueue.Map(
            retBuffer,
            false,
            ComputeMemoryMappingFlags.Read,
            0,
            ret.Length, null);

        _commandQueue.Unmap(retBuffer, ref retPtr, null);
        _commandQueue.Finish();

        matrix1Buffer.Dispose();
        matrix2Buffer.Dispose();
        retBuffer.Dispose();

        return ret;
    }

The timing says it all, however. 时机说明了一切。 My program spits this out: 我的程序将其吐出：

CPU Matrix multiplication: 1178.5ms CPU矩阵乘法：1178.5ms

GPU Matrix multiplication (copy): 115.1ms GPU矩阵乘法（复制）：115.1毫秒

GPU Matrix multiplication (zero copy): 174.1ms GPU矩阵乘法（零复制）：174.1ms

GPU (w/ copy) is 10.23892x faster. GPU（含副本）的速度提高了10.23892倍。

GPU (zero copy) is 6.769098x faster. GPU（零副本）速度为6.769098x。

...so zero copy is slower. ...所以零拷贝比较慢。

Answer 1

Thanks to huseyin tugrul buyukisik I was able to figure out what was going on. 多亏了huseyin tugrul buyukisik，我才知道发生了什么。

I needed to update my Intel drivers. 我需要更新我的英特尔驱动程序。 Once I did this, then the zero-copy was much, much faster. 一旦完成此操作，零副本的速度就会快得多。

For the sake of posterity, here is the final version of the zero-copy code: 为了后代，这是零复制代码的最终版本：

    public float[] MultiplyMatricesZeroCopy(float[] matrix1, float[] matrix2,
        int matrix1Height, int matrix1WidthMatrix2Height, int matrix2Width)
    {
        if (!_initialized)
        {
            Initialize();
            _initialized = true;
        }

        ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix1);
        _kernel.SetMemoryArgument(0, matrix1Buffer);

        ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix2);
        _kernel.SetMemoryArgument(1, matrix2Buffer);

        float[] ret = new float[matrix1Height * matrix2Width];
        GCHandle handle = GCHandle.Alloc(ret, GCHandleType.Pinned); 
        ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.UseHostPointer,
            ret);
        _kernel.SetMemoryArgument(2, retBuffer);

        _kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
        _kernel.SetValueArgument<int>(4, matrix2Width);

        _commandQueue.Execute(_kernel,
            new long[] { 0 },
            new long[] { matrix2Width, matrix1Height },
            null, null);

        IntPtr retPtr = _commandQueue.Map(
            retBuffer,
            true,
            ComputeMemoryMappingFlags.Read,
            0,
            ret.Length, null);

        _commandQueue.Unmap(retBuffer, ref retPtr, null);
        //_commandQueue.Finish();

        matrix1Buffer.Dispose();
        matrix2Buffer.Dispose();
        retBuffer.Dispose();
        handle.Free(); 

        return ret;
    }

在我的OpenCL / Cloo（C＃）程序中，“零复制”比非零复制慢

问题描述

1 个解决方案

解决方案1
2

在我的OpenCL / Cloo（C＃）程序中，“零复制”比非零复制慢

问题描述

1 个解决方案

解决方案1 2

解决方案1
2