如何減少OpenCL / Cloo（C＃）緩沖區創建的開銷？

Question

我正在通過C＃Cloo界面使用OpenCL，當我試圖讓它在我們的產品中運行良好時，我遇到了一些非常令人沮喪的問題。

我們的產品是一款計算機視覺產品，每秒三十秒，從我們的相機中獲得512x424像素值網格，而不會給予太多影響。 我們希望對這些像素進行計算，以生成相對於場景中某些對象的點雲。

我在嘗試計算這些像素時所做的是，當我們得到一個新幀時，以下（每一幀）：

1）創建一個CommandQueue，2）創建一個只讀取輸入像素值的緩沖區，3）創建一個只寫輸出點值的零復制緩沖區。 4）傳入矩陣以在GPU上進行計算，5）執行內核並等待響應。

每幀工作的一個例子是：

        // the command queue is the, well, queue of commands sent to the "device" (GPU)
        ComputeCommandQueue commandQueue = new ComputeCommandQueue(
            _context, // the compute context
            _context.Devices[0], // first device matching the context specifications
            ComputeCommandQueueFlags.None); // no special flags

        Point3D[] realWorldPoints = points.Get(Perspective.RealWorld).Points;
        ComputeBuffer<Point3D> realPointsBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
            realWorldPoints);
        _kernel.SetMemoryArgument(0, realPointsBuffer);

        Point3D[] toPopulate = new Point3D[realWorldPoints.Length];
        PointSet pointSet = points.Get(perspective);

        ComputeBuffer<Point3D> resultBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.UseHostPointer,
            toPopulate);
        _kernel.SetMemoryArgument(1, resultBuffer);
            float[] M = new float[3 * 3];
            ReferenceFrame referenceFrame =
                perspectives.ReferenceFrames[(int)Perspective.Floor];
            AffineTransformation transform = referenceFrame.ToReferenceFrame;
            M[0] = transform.M00;
            M[1] = transform.M01;
            M[2] = transform.M02;
            M[3] = transform.M10;
            M[4] = transform.M11;
            M[5] = transform.M12;
            M[6] = transform.M20;
            M[7] = transform.M21;
            M[8] = transform.M22;

            ComputeBuffer<float> Mbuffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
                M);
            _kernel.SetMemoryArgument(2, Mbuffer);

            float[] b = new float[3];
            b[0] = transform.b0;
            b[1] = transform.b1;
            b[2] = transform.b2;

            ComputeBuffer<float> Bbuffer = new ComputeBuffer<float>(_context,
                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer,
                b);
            _kernel.SetMemoryArgument(3, Bbuffer);

            _kernel.SetValueArgument<int>(4, (int)Perspective.Floor);

            //sw.Start();

            commandQueue.Execute(_kernel,
                new long[] { 0 }, new long[] { toPopulate.Length }, null, null);
            IntPtr retPtr = commandQueue.Map(
                resultBuffer,
                true,
                ComputeMemoryMappingFlags.Read,
                0,
                toPopulate.Length, null);

            commandQueue.Unmap(resultBuffer, ref retPtr, null);

分析時，WAAAY的時間太長，90％的時間都是在創建所有ComputeBuffer對象等時彌補的.GPU上的實際計算時間很快。

我的問題是，我該如何解決這個問題？ 進來的像素數組對於每一幀都是不同的，所以我必須為此創建一個新的ComputeBuffer。 當我們更新場景時，我們的矩陣也會定期更改（同樣，我無法詳細介紹所有細節）。 有沒有辦法在GPU上更新這些緩沖區？ 我正在使用英特爾GPGPU，所以我有共享內存，理論上可以這樣做。

它變得令人沮喪，因為我一次又一次地在GPU上找到的速度增加，淹沒了為每一幀設置所有內容的開銷。

編輯1：

我不認為我的原代碼示例秀出真我在做什么做得不夠好，所以我創建了一個真實的世界，例如工作，並張貼在GitHub上這里。

由於遺留原因和時間原因，我無法改變我們當前產品的最重要架構。 我試圖在某些速度較慢的部分“插入”GPU代碼以加快速度。 考慮到我所看到的限制，這可能是不可能的。 但是，讓我更好地解釋一下我在做什么。

我將給出代碼，但我將在“GPUComputePoints”類中引用“ComputePoints”函數。

正如您在我的ComputePoints函數中所看到的，每次傳入CameraFrame以及轉換矩陣M和b。

public static Point3D[] ComputePoints(CameraFrame frame, float[] M, float[] b)

這些是從我們的管道生成的新數組，而不是我可以留下的數組。 所以我為每個創建一個新的ComputeBuffer：

       ComputeBuffer<ushort> inputBuffer = new ComputeBuffer<ushort>(_context,
          ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
          frame.RawData);
        _kernel.SetMemoryArgument(0, inputBuffer);

        Point3D[] ret = new Point3D[frame.Width * frame.Height]; 
        ComputeBuffer<Point3D> outputBuffer = new ComputeBuffer<Point3D>(_context,
            ComputeMemoryFlags.WriteOnly | ComputeMemoryFlags.UseHostPointer,
            ret);
        _kernel.SetMemoryArgument(1, outputBuffer);

        ComputeBuffer<float> mBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            M);
        _kernel.SetMemoryArgument(2, mBuffer);

        ComputeBuffer<float> bBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            b);
         _kernel.SetMemoryArgument(3, bBuffer);

......我相信，其中存在着對性能的影響。 有人提到，為了解決這個問題，請使用map / unmap功能。 但我沒有看到這將如何幫助，因為我仍然需要每次創建緩沖區來封裝傳入的新數組，對吧？

Answer 1

進來的像素數組對於每一幀都是不同的，所以我必須為此創建一個新的ComputeBuffer。

您可以創建一個大緩沖區，然后將其范圍用於多個不同的幀。 然后，您不必在每個幀重新創建（也不重新發布）。

當我們更新場景時，我們的矩陣也會定期更改（同樣，我無法詳細介紹所有細節）。

對於N次迭代/幀的每個未使用的緩沖區，您可以釋放，對於每個非足夠的緩沖區存在，您可以釋放最后一個並重新創建2x更大的緩沖區以再次釋放之前使用多次。

如果內核參數的數量和順序保持不變，則不需要在每個幀都設置它們。

有沒有辦法在GPU上更新這些緩沖區？

對於opencl版本<= 1.2（沒有共享虛擬內存？），建議不要在主機端使用設備端指針或在設備端使用主機端指針

但是，如果它不與視頻適配器或生成視頻幀的任何內容（可能使用use_host_ptr）沖突，它可能會起作用。

無需重新創建CommandQueue。 創建一次，用於每個有序的工作。

如果你因為類似於以下軟件設計而重新創建所有這些：

 float [] results = test(videoFeedData);

然后你可以嘗試類似的東西

float [] results = new float[n];
test(videoFeedData,results);

因此它不需要創建所有內容，而是獲取結果或輸入數據的大小，然后創建opencl緩沖區一次，將其緩存在某個地方（如地圖/字典），然后在采用類似大小的數組時重新使用。

實際工作如下：

new frame feed-0: 1kB data ---> allocate 1kB
feed-1: 10 MB data ---> allocate 10 MB, delete 1kB one
feed-2: 3 MB data ---> re-use 10MB one
feed-3: 2 kB data ---> re-use 10MB 
feed-4: 100 MB data ---> delete 10MB, allocate 100MB
feed-5: 110 MB data ----> delete 100MB, allocate 200MB
feed-6: 120 MB data  ---> re-use 200 MB
feed-7: 150 MB data  ---> re-use 200 MB 
feed-8: 90 MB data  ---> re-use 200 MB

輸入和輸出數據。

除了實際重新創建的開銷之外，重新創建許多東西會阻礙驅動程序優化和重置。

也許是這樣的：

 CoresGpu gpu = new CoresGpu(kernelString,options,"gpu");

 for(i 0 to 100)
 {
   float [] results = new float[n];

   // allocate new, if only not enough, deallocate old, if only not used
   gpu.compute(new object[]{getVideoFeedBuffer(),brush21x21array,results},
             new string[]{"input","input","output"},
             kernelName,numberOfThreads);

   toCloudDb(results.toList());
 }

 gpu.release(); // everything is released here

如果重新創建是必須的，沒有辦法逃避它，那么你甚至可以進行流水線操作來隱藏重新創建的延遲（但仍然比完美慢）。

push data
thread-0:get video feed

push data
thread-0:get next video feed
thread-1:send old video feed to gpu

push data
thread-0:get third video feed
thread-1:send second video feed to gpu
thread-2:compute on gpu

push data
thread-0:get fourth video feed
thread-1:send third video feed to gpu
thread-2:compute second frame on gpu
thread-3:get result of first frame from gpu to RAM

push data
thread-0:get fifth video feed
thread-1:send fourth video feed to gpu
thread-2:compute third frame on gpu
thread-3:get result of second frame from gpu to RAM
pop first data

...
...
pop second data

像這樣繼續使用類似的東西：

var result=gpu.pipeline.push(videoFeed);
if(result!=null)
{ result has been popped! }

重新創建延遲的一部分被計算，復制，錄像和彈出操作隱藏。 如果重新創建是總時間的％90，則它將僅隱藏％10。 如果是％50則隱藏其他％50。

5）執行內核並等待響應。

干嘛要等？ 框架是否相互綁定？ 如果沒有，您也可以使用多個管道。 然后，您可以在每個管道中同時重新創建多個緩沖區，這樣可以完成更多的工作，但浪費的周期太多。 使用大緩沖區可以最快。

如何減少OpenCL / Cloo（C＃）緩沖區創建的開銷？

問題描述

1 個解決方案

解決方案1
0 2017-02-24 00:05:56

如何減少OpenCL / Cloo（C＃）緩沖區創建的開銷？

問題描述

1 個解決方案

解決方案1 0 2017-02-24 00:05:56

解決方案1
0 2017-02-24 00:05:56