如何加速iOS / Mac OS的金屬代碼

Question

我正在嘗試在Metal中實現代碼，該代碼在兩個具有長度的向量之間執行一維卷積。 我已經實現了以下正常工作

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const device int& dataSize [[ buffer(1) ]],
                     const device float *filterVector [[ buffer(2) ]],
                     const device int& filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]]) {
    int outputSize = dataSize - filterSize + 1;
    for (int i=0;i<outputSize;i++) {
        float sum = 0.0;
        for (int j=0;j<filterSize;j++) {
            sum += dataVector[i+j] * filterVector[j];
        }
        outVector[i] = sum;
    }
}

我的問題是使用Metal處理（計算+與GPU之間的數據傳輸）相同的數據需要大約10倍的時間，而不是CPU上的Swift。 我的問題是如何用單個向量操作替換內部循環還是有另一種方法來加速上面的代碼？

Answer 1

在這種情況下利用GPU並行性的關鍵是讓它為您管理外部循環。 我們不會為整個數據向量調用一次內核，而是為數據向量中的每個元素調用它。 內核函數簡化了這個：

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const constant int &dataSize [[ buffer(1) ]],
                     const constant float *filterVector [[ buffer(2) ]],
                     const constant int &filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]])
{
    float sum = 0.0;
    for (int i = 0; i < filterSize; ++i) {
        sum += dataVector[id + i] * filterVector[i];
    }
    outVector[id] = sum;
}

為了分派這項工作，我們根據計算管道狀態建議的線程執行寬度選擇一個線程組大小。 這里一個棘手的問題是確保輸入和輸出緩沖區中有足夠的填充，這樣我們就可以稍微超出數據的實際大小。 這確實會導致我們浪費少量的內存和計算，但是為了節省我們執行單獨調度的復雜性，只是為了計算緩沖區末尾元素的卷積。

// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).

let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)

let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()

在我的實驗中，這種並行化方法比問題中的串行版本快400-1000倍 。 我很想知道它與你的CPU實現相比如何。

Answer 2

以下代碼顯示了如何使用Objective-C Metal API在GPU上並行渲染編碼命令（上面的線程代碼僅將輸出的渲染划分為網格部分以進行並行處理;計算仍未並行執行）。 這就是你在問題中提到的，即使它並不是你想要的。 我已經提供了這個答案，以幫助那些可能偶然發現這個問題的人，認為它將提供與並行渲染相關的答案（事實上，它實際上並非如此）：

    - (void)drawInMTKView:(MTKView *)view
    {
        dispatch_async(((AppDelegate *)UIApplication.sharedApplication.delegate).cameraViewQueue, ^{
                    id <CAMetalDrawable> drawable = [view currentDrawable]; //[(CAMetalLayer *)view.layer nextDrawable];
                    MTLRenderPassDescriptor *renderPassDesc = [view currentRenderPassDescriptor];
                    renderPassDesc.colorAttachments[0].loadAction = MTLLoadActionClear;
                    renderPassDesc.colorAttachments[0].clearColor = MTLClearColorMake(0.0,0.0,0.0,1.0);
                    renderPassDesc.renderTargetWidth = self.texture.width;
                    renderPassDesc.renderTargetHeight = self.texture.height;
                    renderPassDesc.colorAttachments[0].texture = drawable.texture;
                    if (renderPassDesc != nil)
                    {
                        dispatch_semaphore_wait(self._inflight_semaphore, DISPATCH_TIME_FOREVER);
                        id <MTLCommandBuffer> commandBuffer = [self.metalContext.commandQueue commandBuffer];
                        [commandBuffer enqueue];
            // START PARALLEL RENDERING OPERATIONS HERE
                        id <MTLParallelRenderCommandEncoder> parallelRCE = [commandBuffer parallelRenderCommandEncoderWithDescriptor:renderPassDesc];
// FIRST PARALLEL RENDERING OPERATION
                        id <MTLRenderCommandEncoder> renderEncoder = [parallelRCE renderCommandEncoder];

                        [renderEncoder setRenderPipelineState:self.metalContext.renderPipelineState];

                        [renderEncoder setVertexBuffer:self.metalContext.vertexBuffer offset:0 atIndex:0];
                        [renderEncoder setVertexBuffer:self.metalContext.uniformBuffer offset:0 atIndex:1];

                        [renderEncoder setFragmentBuffer:self.metalContext.uniformBuffer offset:0 atIndex:0];

                        [renderEncoder setFragmentTexture:self.texture
                                                  atIndex:0];

                        [renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip
                                          vertexStart:0
                                          vertexCount:4
                                        instanceCount:1];

                        [renderEncoder endEncoding];
            // ADD SECOND, THIRD, ETC. PARALLEL RENDERING OPERATION HERE
.
.
.
// SUBMIT ALL RENDERING OPERATIONS IN PARALLEL HERE
                        [parallelRCE endEncoding];

                        __block dispatch_semaphore_t block_sema = self._inflight_semaphore;
                        [commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
                            dispatch_semaphore_signal(block_sema);

                        }];

                        if (drawable)
                            [commandBuffer presentDrawable:drawable];
                        [commandBuffer commit];
                        [commandBuffer waitUntilScheduled];
                    }
        });
    }

在上面的示例中，您將為要並行執行的每個計算復制與renderEncoder相關的內容。 我沒有看到這對你的代碼示例有什么好處，因為一個操作似乎依賴於另一個操作。 那么，你可能希望的最好的是warrenm為你提供的代碼，盡管這並不是真正有資格作為並行渲染的。

如何加速iOS / Mac OS的金屬代碼

問題描述

2 個解決方案

解決方案1
11 已采納 2016-08-24 23:05:55

解決方案2
-1 2018-06-17 22:35:37

如何加速iOS / Mac OS的金屬代碼

問題描述

2 個解決方案

解決方案1 11 已采納 2016-08-24 23:05:55

解決方案2 -1 2018-06-17 22:35:37

解決方案1
11 已采納 2016-08-24 23:05:55

解決方案2
-1 2018-06-17 22:35:37