I need an implementation of upper_bound
as described in the STL for my metal compute kernel. Not having anything in the metal standard library, I essentially copied it from <algorithm>
into my shader file like so:
static device float* upper_bound( device float* first, device float* last, float val)
{
ptrdiff_t count = last - first;
while( count > 0){
device float* it = first;
ptrdiff_t step = count/2;
it += step;
if( !(val < *it)){
first = ++it;
count -= step + 1;
}else count = step;
}
return first;
}
I created a simple kernel to test it like so:
kernel void upper_bound_test(
device float* input [[buffer(0)]],
device uint* output [[buffer(1)]]
)
{
device float* where = upper_bound( input, input + 5, 3.1);
output[0] = where - input;
}
Which for this test has a hardcoded input size and search value. I also hardcoded a 5 element input buffer on the framework side as you'll see below. This kernel I expect to return the index of the first input greater than 3.1
It doesn't work. In fact output[0]
is never written--as I preloaded the buffer with a magic number to see if it gets over-written. It doesn't. In fact after waitUntilCompleted
, commandBuffer.error
looks like this:
Error Domain = MTLCommandBufferErrorDomain
Code = 1
NSLocalizedDescription = "IOAcceleratorFamily returned error code 3"
What does error code 3 mean? Did my kernel get killed before it had a chance to finish?
Further, I tried just a linear search version of upper_bound
like so:
static device float* upper_bound2( device float* first, device float* last, float val)
{
while( first < last && *first <= val)
++first;
return first;
}
This one works (sort-of). I have the same problem with a binary search lower_bound from <algorithm>
--yet a naive linear version works (sort-of). BTW, I tested my STL copied versions from straight C-code (with device
removed obviously) and they work fine outside of shader-land. Please tell me I'm doing something wrong and this is not a metal compiler bug.
Now about that "sort-of" above: the linear search versions work on a 5s and mini-2 (A7s) (returns index 3 in the example above), but on a 6+ (A8) it gives the right answer + 2^31. What the heck! Same exact code. Note on the framework side I use uint32_t
and on the shader side I use uint
--which are the same thing. Note also that every pointer subtraction ( ptrdiff_t
are signed 8-byte things) are small non-negative values. Why is the 6+ setting that high order bit? And of course, why don't my real binary search versions work?
Here is the framework side stuff:
id<MTLFunction> upperBoundTestKernel = [_library newFunctionWithName: @"upper_bound_test"];
id <MTLComputePipelineState> upperBoundTestPipelineState = [_device
newComputePipelineStateWithFunction: upperBoundTestKernel
error: &err];
float sortedNumbers[] = {1., 2., 3., 4., 5.};
id<MTLBuffer> testInputBuffer = [_device
newBufferWithBytes:(const void *)sortedNumbers
length: sizeof(sortedNumbers)
options: MTLResourceCPUCacheModeDefaultCache];
id<MTLBuffer> testOutputBuffer = [_device
newBufferWithLength: sizeof(uint32_t)
options: MTLResourceCPUCacheModeDefaultCache];
*(uint32_t*)testOutputBuffer.contents = 42;//magic number better get clobbered
id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState: upperBoundTestPipelineState];
[commandEncoder setBuffer: testInputBuffer offset: 0 atIndex: 0];
[commandEncoder setBuffer: testOutputBuffer offset: 0 atIndex: 1];
[commandEncoder
dispatchThreadgroups: MTLSizeMake( 1, 1, 1)
threadsPerThreadgroup: MTLSizeMake( 1, 1, 1)];
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
uint32_t answer = *(uint32_t*)testOutputBuffer.contents;
Well, I've found a solution/work-around. I guessed it was a pointer-aliasing problem since first
and last
pointed into the same buffer. So I changed them to offsets from a single pointer variable. Here's a re-written upper_bound2:
static uint upper_bound2( device float* input, uint first, uint last, float val)
{
while( first < last && input[first] <= val)
++first;
return first;
}
And a re-written test kernel:
kernel void upper_bound_test(
device float* input [[buffer(0)]],
device uint* output [[buffer(1)]]
)
{
output[0] = upper_bound2( input, 0, 5, 3.1);
}
This worked--completely. That is, not only did it fix the "sort-of" problem for the linear search, but a similarly re-written binary search worked too. I don't want to believe this though. The metal shader language is supposed to be a subset of C++, yet standard pointer semantics don't work? Can I really not compare or subtract pointers?
Anyway, I don't recall seeing any docs saying there can be no pointer aliasing or what declaration incantation would help me here. Any more help?
[UPDATE]
For the record, as pointed out by "slime" on Apple's dev forum: https://developer.apple.com/library/ios/documentation/Metal/Reference/MetalShadingLanguageGuide/func-var-qual/func-var-qual.html#//apple_ref/doc/uid/TP40014364-CH4-SW3
"Buffers (device and constant) specified as argument values to a graphics or kernel function cannot be aliased—that is, a buffer passed as an argument value cannot overlap another buffer passed to a separate argument of the same graphics or kernel function."
But it's also worth noting that upper_bound() is not a kernel function and upper_bound_test() is not passed aliased arguments. What upper_bound_test() does do is create a local temporary that points into the same buffer as one of its arguments. Perhaps the docs should say what it means, something like: "No pointer aliasing to device and constant memory in any function is allowed including rvalues." I don't actually know if this is too strong.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.