[英]“Zero copy” is slower in my OpenCL/Cloo(C#) program than non-zero copy
This may simply be an issue with the memory objects being allocated by the .NET framework not being properly page-aligned, but I cannot see why zero-copy is slower for me than non-zero copy. 这可能只是一个问题,.NET框架分配的内存对象未正确进行页面对齐,但是我看不到为什么零拷贝对我来说比非零拷贝要慢。
I'll include code inline in this question, but the complete source can be seen here: https://github.com/kwende/ClooMatrixMultiply/blob/master/GiantMatrixOnGPU/GPUMatrixMultiplier.cs . 我将在此问题中内联代码,但完整的源代码可以在这里看到: https : //github.com/kwende/ClooMatrixMultiply/blob/master/GiantMatrixOnGPU/GPUMatrixMultiplier.cs 。
Since this is my first attempt at getting zero-copy working, I wrote up a simple matrix multiplication example. 由于这是我首次尝试实现零拷贝工作,因此我编写了一个简单的矩阵乘法示例。 I first initialize my OpenCL objects:
我首先初始化我的OpenCL对象:
private void Initialize()
{
// get the intel integrated GPU
_integratedIntelGPUPlatform = ComputePlatform.Platforms.Where(n => n.Name.Contains("Intel")).First();
// create the compute context.
_context = new ComputeContext(
ComputeDeviceTypes.Gpu, // use the gpu
new ComputeContextPropertyList(_integratedIntelGPUPlatform), // use the intel openCL platform
null,
IntPtr.Zero);
// the command queue is the, well, queue of commands sent to the "device" (GPU)
_commandQueue = new ComputeCommandQueue(
_context, // the compute context
_context.Devices[0], // first device matching the context specifications
ComputeCommandQueueFlags.None); // no special flags
string kernelSource = null;
using (StreamReader sr = new StreamReader("kernel.cl"))
{
kernelSource = sr.ReadToEnd();
}
// create the "program"
_program = new ComputeProgram(_context, new string[] { kernelSource });
// compile.
_program.Build(null, null, null, IntPtr.Zero);
_kernel = _program.CreateKernel("ComputeMatrix");
}
...this is only executed once if my code hasn't been initialized. ...这仅在我的代码尚未初始化的情况下执行一次。 Then I get into the main body.
然后我进入主体。 For non-zero copy, I do the following:
对于非零副本,我执行以下操作:
public float[] MultiplyMatrices(float[] matrix1, float[] matrix2,
int matrix1Height, int matrix1WidthMatrix2Height, int matrix2Width)
{
if (!_initialized)
{
Initialize();
_initialized = true;
}
ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix1);
_kernel.SetMemoryArgument(0, matrix1Buffer);
ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix2);
_kernel.SetMemoryArgument(1, matrix2Buffer);
float[] ret = new float[matrix1Height * matrix2Width];
ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer,
ret);
_kernel.SetMemoryArgument(2, retBuffer);
_kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
_kernel.SetValueArgument<int>(4, matrix2Width);
_commandQueue.Execute(_kernel,
new long[] { 0 },
new long[] { matrix2Width, matrix1Height },
null, null);
unsafe
{
fixed (float* retPtr = ret)
{
_commandQueue.Read(retBuffer,
false, 0,
ret.Length,
new IntPtr(retPtr),
null);
_commandQueue.Finish();
}
}
matrix1Buffer.Dispose();
matrix2Buffer.Dispose();
retBuffer.Dispose();
return ret;
}
You can see how I'm explicitly setting CopyHostPointer for all of my ComputeBuffer allocations. 您可以看到我如何为所有ComputeBuffer分配显式设置CopyHostPointer。 This executes fine.
执行良好。
I then do the following adjustment to (which includes setting "UseHostPointer" and calling Map/Unmap instead of Read): 然后,我进行以下调整(包括设置“ UseHostPointer”并调用Map / Unmap而不是Read):
public float[] MultiplyMatricesZeroCopy(float[] matrix1, float[] matrix2,
int matrix1Height, int matrix1WidthMatrix2Height, int matrix2Width)
{
if (!_initialized)
{
Initialize();
_initialized = true;
}
ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix1);
_kernel.SetMemoryArgument(0, matrix1Buffer);
ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix2);
_kernel.SetMemoryArgument(1, matrix2Buffer);
float[] ret = new float[matrix1Height * matrix2Width];
ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
ret);
_kernel.SetMemoryArgument(2, retBuffer);
_kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
_kernel.SetValueArgument<int>(4, matrix2Width);
_commandQueue.Execute(_kernel,
new long[] { 0 },
new long[] { matrix2Width, matrix1Height },
null, null);
IntPtr retPtr = _commandQueue.Map(
retBuffer,
false,
ComputeMemoryMappingFlags.Read,
0,
ret.Length, null);
_commandQueue.Unmap(retBuffer, ref retPtr, null);
_commandQueue.Finish();
matrix1Buffer.Dispose();
matrix2Buffer.Dispose();
retBuffer.Dispose();
return ret;
}
The timing says it all, however. 时机说明了一切。 My program spits this out:
我的程序将其吐出:
CPU Matrix multiplication: 1178.5ms CPU矩阵乘法:1178.5ms
GPU Matrix multiplication (copy): 115.1ms GPU矩阵乘法(复制):115.1毫秒
GPU Matrix multiplication (zero copy): 174.1ms GPU矩阵乘法(零复制):174.1ms
GPU (w/ copy) is 10.23892x faster. GPU(含副本)的速度提高了10.23892倍。
GPU (zero copy) is 6.769098x faster. GPU(零副本)速度为6.769098x。
...so zero copy is slower. ...所以零拷贝比较慢。
Thanks to huseyin tugrul buyukisik I was able to figure out what was going on. 多亏了huseyin tugrul buyukisik,我才知道发生了什么。
I needed to update my Intel drivers. 我需要更新我的英特尔驱动程序。 Once I did this, then the zero-copy was much, much faster.
一旦完成此操作,零副本的速度就会快得多。
For the sake of posterity, here is the final version of the zero-copy code: 为了后代,这是零复制代码的最终版本:
public float[] MultiplyMatricesZeroCopy(float[] matrix1, float[] matrix2,
int matrix1Height, int matrix1WidthMatrix2Height, int matrix2Width)
{
if (!_initialized)
{
Initialize();
_initialized = true;
}
ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix1);
_kernel.SetMemoryArgument(0, matrix1Buffer);
ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix2);
_kernel.SetMemoryArgument(1, matrix2Buffer);
float[] ret = new float[matrix1Height * matrix2Width];
GCHandle handle = GCHandle.Alloc(ret, GCHandleType.Pinned);
ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
ComputeMemoryFlags.UseHostPointer,
ret);
_kernel.SetMemoryArgument(2, retBuffer);
_kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
_kernel.SetValueArgument<int>(4, matrix2Width);
_commandQueue.Execute(_kernel,
new long[] { 0 },
new long[] { matrix2Width, matrix1Height },
null, null);
IntPtr retPtr = _commandQueue.Map(
retBuffer,
true,
ComputeMemoryMappingFlags.Read,
0,
ret.Length, null);
_commandQueue.Unmap(retBuffer, ref retPtr, null);
//_commandQueue.Finish();
matrix1Buffer.Dispose();
matrix2Buffer.Dispose();
retBuffer.Dispose();
handle.Free();
return ret;
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.