简体   繁体   中英

Is there a compiler bug for my iOS metal compute kernel or am I missing something?

I need an implementation of upper_bound as described in the STL for my metal compute kernel. Not having anything in the metal standard library, I essentially copied it from <algorithm> into my shader file like so:

static device float* upper_bound( device float* first, device float* last, float val)
{
    ptrdiff_t count = last - first;
    while( count > 0){
        device float* it = first;
        ptrdiff_t step = count/2;
        it += step;
        if( !(val < *it)){
            first = ++it;
            count -= step + 1;
        }else count = step;
    }
    return first;
}

I created a simple kernel to test it like so:

kernel void upper_bound_test(
    device float* input [[buffer(0)]],
    device uint* output [[buffer(1)]]
)
{
    device float* where = upper_bound( input, input + 5, 3.1);
    output[0] = where - input;
}

Which for this test has a hardcoded input size and search value. I also hardcoded a 5 element input buffer on the framework side as you'll see below. This kernel I expect to return the index of the first input greater than 3.1

It doesn't work. In fact output[0] is never written--as I preloaded the buffer with a magic number to see if it gets over-written. It doesn't. In fact after waitUntilCompleted , commandBuffer.error looks like this:

Error Domain = MTLCommandBufferErrorDomain
Code = 1
NSLocalizedDescription = "IOAcceleratorFamily returned error code 3"

What does error code 3 mean? Did my kernel get killed before it had a chance to finish?

Further, I tried just a linear search version of upper_bound like so:

static device float* upper_bound2( device float* first, device float* last, float val)
{
    while( first < last && *first <= val)
        ++first;
    return first;
}

This one works (sort-of). I have the same problem with a binary search lower_bound from <algorithm> --yet a naive linear version works (sort-of). BTW, I tested my STL copied versions from straight C-code (with device removed obviously) and they work fine outside of shader-land. Please tell me I'm doing something wrong and this is not a metal compiler bug.

Now about that "sort-of" above: the linear search versions work on a 5s and mini-2 (A7s) (returns index 3 in the example above), but on a 6+ (A8) it gives the right answer + 2^31. What the heck! Same exact code. Note on the framework side I use uint32_t and on the shader side I use uint --which are the same thing. Note also that every pointer subtraction ( ptrdiff_t are signed 8-byte things) are small non-negative values. Why is the 6+ setting that high order bit? And of course, why don't my real binary search versions work?

Here is the framework side stuff:

id<MTLFunction> upperBoundTestKernel = [_library newFunctionWithName: @"upper_bound_test"];
id <MTLComputePipelineState> upperBoundTestPipelineState = [_device
    newComputePipelineStateWithFunction: upperBoundTestKernel
    error: &err];


float sortedNumbers[] = {1., 2., 3., 4., 5.};
id<MTLBuffer> testInputBuffer = [_device
    newBufferWithBytes:(const void *)sortedNumbers
    length: sizeof(sortedNumbers)
    options: MTLResourceCPUCacheModeDefaultCache];

id<MTLBuffer> testOutputBuffer = [_device
    newBufferWithLength: sizeof(uint32_t)
    options: MTLResourceCPUCacheModeDefaultCache];

*(uint32_t*)testOutputBuffer.contents = 42;//magic number better get clobbered

id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState: upperBoundTestPipelineState];
[commandEncoder setBuffer: testInputBuffer offset: 0 atIndex: 0];
[commandEncoder setBuffer: testOutputBuffer offset: 0 atIndex: 1];
[commandEncoder
    dispatchThreadgroups: MTLSizeMake( 1, 1, 1)
    threadsPerThreadgroup: MTLSizeMake( 1, 1, 1)];
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];

uint32_t answer = *(uint32_t*)testOutputBuffer.contents;

Well, I've found a solution/work-around. I guessed it was a pointer-aliasing problem since first and last pointed into the same buffer. So I changed them to offsets from a single pointer variable. Here's a re-written upper_bound2:

static uint upper_bound2( device float* input, uint first, uint last, float val)
{
    while( first < last && input[first] <= val)
        ++first;
    return first;
}

And a re-written test kernel:

kernel void upper_bound_test(
    device float* input [[buffer(0)]],
    device uint* output [[buffer(1)]]
)
{
    output[0] = upper_bound2( input, 0, 5, 3.1);
}

This worked--completely. That is, not only did it fix the "sort-of" problem for the linear search, but a similarly re-written binary search worked too. I don't want to believe this though. The metal shader language is supposed to be a subset of C++, yet standard pointer semantics don't work? Can I really not compare or subtract pointers?

Anyway, I don't recall seeing any docs saying there can be no pointer aliasing or what declaration incantation would help me here. Any more help?

[UPDATE]

For the record, as pointed out by "slime" on Apple's dev forum: https://developer.apple.com/library/ios/documentation/Metal/Reference/MetalShadingLanguageGuide/func-var-qual/func-var-qual.html#//apple_ref/doc/uid/TP40014364-CH4-SW3

"Buffers (device and constant) specified as argument values to a graphics or kernel function cannot be aliased—that is, a buffer passed as an argument value cannot overlap another buffer passed to a separate argument of the same graphics or kernel function."

But it's also worth noting that upper_bound() is not a kernel function and upper_bound_test() is not passed aliased arguments. What upper_bound_test() does do is create a local temporary that points into the same buffer as one of its arguments. Perhaps the docs should say what it means, something like: "No pointer aliasing to device and constant memory in any function is allowed including rvalues." I don't actually know if this is too strong.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM