Is there a compiler bug for my iOS metal compute kernel or am I missing something?

I need an implementation of upper_bound as described in the STL for my metal compute kernel. Not having anything in the metal standard library, I essentially copied it from <algorithm> into my shader file like so:

static device float* upper_bound( device float* first, device float* last, float val)
    ptrdiff_t count = last - first;
    while( count > 0){
        device float* it = first;
        ptrdiff_t step = count/2;
        it += step;
        if( !(val < *it)){
            first = ++it;
            count -= step + 1;
        }else count = step;
    return first;

I created a simple kernel to test it like so:

kernel void upper_bound_test(
    device float* input [[buffer(0)]],
    device uint* output [[buffer(1)]]
    device float* where = upper_bound( input, input + 5, 3.1);
    output[0] = where - input;

Which for this test has a hardcoded input size and search value. I also hardcoded a 5 element input buffer on the framework side as you'll see below. This kernel I expect to return the index of the first input greater than 3.1

It doesn't work. In fact output[0] is never written--as I preloaded the buffer with a magic number to see if it gets over-written. It doesn't. In fact after waitUntilCompleted , commandBuffer.error looks like this:

Error Domain = MTLCommandBufferErrorDomain
Code = 1
NSLocalizedDescription = "IOAcceleratorFamily returned error code 3"

What does error code 3 mean? Did my kernel get killed before it had a chance to finish?

Further, I tried just a linear search version of upper_bound like so:

static device float* upper_bound2( device float* first, device float* last, float val)
    while( first < last && *first <= val)
    return first;

This one works (sort-of). I have the same problem with a binary search lower_bound from <algorithm> --yet a naive linear version works (sort-of). BTW, I tested my STL copied versions from straight C-code (with device removed obviously) and they work fine outside of shader-land. Please tell me I'm doing something wrong and this is not a metal compiler bug.

Now about that "sort-of" above: the linear search versions work on a 5s and mini-2 (A7s) (returns index 3 in the example above), but on a 6+ (A8) it gives the right answer + 2^31. What the heck! Same exact code. Note on the framework side I use uint32_t and on the shader side I use uint --which are the same thing. Note also that every pointer subtraction ( ptrdiff_t are signed 8-byte things) are small non-negative values. Why is the 6+ setting that high order bit? And of course, why don't my real binary search versions work?

Here is the framework side stuff:

id<MTLFunction> upperBoundTestKernel = [_library newFunctionWithName: @"upper_bound_test"];
id <MTLComputePipelineState> upperBoundTestPipelineState = [_device
    newComputePipelineStateWithFunction: upperBoundTestKernel
    error: &err];

float sortedNumbers[] = {1., 2., 3., 4., 5.};
id<MTLBuffer> testInputBuffer = [_device
    newBufferWithBytes:(const void *)sortedNumbers
    length: sizeof(sortedNumbers)
    options: MTLResourceCPUCacheModeDefaultCache];

id<MTLBuffer> testOutputBuffer = [_device
    newBufferWithLength: sizeof(uint32_t)
    options: MTLResourceCPUCacheModeDefaultCache];

*(uint32_t*)testOutputBuffer.contents = 42;//magic number better get clobbered

id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState: upperBoundTestPipelineState];
[commandEncoder setBuffer: testInputBuffer offset: 0 atIndex: 0];
[commandEncoder setBuffer: testOutputBuffer offset: 0 atIndex: 1];
    dispatchThreadgroups: MTLSizeMake( 1, 1, 1)
    threadsPerThreadgroup: MTLSizeMake( 1, 1, 1)];
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];

uint32_t answer = *(uint32_t*)testOutputBuffer.contents;

Well, I've found a solution/work-around. I guessed it was a pointer-aliasing problem since first and last pointed into the same buffer. So I changed them to offsets from a single pointer variable. Here's a re-written upper_bound2:

static uint upper_bound2( device float* input, uint first, uint last, float val)
    while( first < last && input[first] <= val)
    return first;

And a re-written test kernel:

kernel void upper_bound_test(
    device float* input [[buffer(0)]],
    device uint* output [[buffer(1)]]
    output[0] = upper_bound2( input, 0, 5, 3.1);

This worked--completely. That is, not only did it fix the "sort-of" problem for the linear search, but a similarly re-written binary search worked too. I don't want to believe this though. The metal shader language is supposed to be a subset of C++, yet standard pointer semantics don't work? Can I really not compare or subtract pointers?

Anyway, I don't recall seeing any docs saying there can be no pointer aliasing or what declaration incantation would help me here. Any more help?


For the record, as pointed out by "slime" on Apple's dev forum: https://developer.apple.com/library/ios/documentation/Metal/Reference/MetalShadingLanguageGuide/func-var-qual/func-var-qual.html#//apple_ref/doc/uid/TP40014364-CH4-SW3

"Buffers (device and constant) specified as argument values to a graphics or kernel function cannot be aliased—that is, a buffer passed as an argument value cannot overlap another buffer passed to a separate argument of the same graphics or kernel function."

But it's also worth noting that upper_bound() is not a kernel function and upper_bound_test() is not passed aliased arguments. What upper_bound_test() does do is create a local temporary that points into the same buffer as one of its arguments. Perhaps the docs should say what it means, something like: "No pointer aliasing to device and constant memory in any function is allowed including rvalues." I don't actually know if this is too strong.

