简体   繁体   English

将无符号字符钳位短

[英]Clamping short to unsigned char

I have a simple C function as follows:我有一个简单的 C function 如下:

unsigned char clamp(short value){
    if (value < 0) return 0;
    if (value > 0xff) return 0xff;
    return value;
}

Is it possible to rewrite it without using any if / else branching while being efficient?是否可以在高效的同时不使用任何 if / else 分支来重写它?

EDIT:编辑:

I basically wish to see if some bitwise arithmetic based implementation of clamping is possible.我基本上希望看看是否可以进行一些基于位算术的钳位实现。 Objective is to process images on GPU (Graphics Processing Unit).目标是在 GPU(图形处理单元)上处理图像。 This type of code will run on each pixel.这种类型的代码将在每个像素上运行。 I guess that if branches can be avoided, then overall throughput over the GPU would be higher.我想如果可以避免分支,那么 GPU 的整体吞吐量会更高。

A solution like (value <0? 0: ((value > 255)? 255: value) ) is simply a rehash of if/else branching with syntactic sugar.像 (value <0? 0: ((value > 255)? 255: value) ) 这样的解决方案只是使用语法糖对 if/else 分支进行重新散列。 So I am not looking for it.所以我不是在寻找它。

EDIT 2:编辑2:

I can cut it down to a single if as follows but I am not able to think better:如果如下,我可以将其缩减为一个,但我无法更好地思考:

unsigned char clamp(short value){
    int more = value >> 8;
    if(more){
        int sign = !(more >> 7);
        return sign * 0xff;
    }
    return value;
}

EDIT 3:编辑 3:

Just saw a very nice implementation of this in FFmpeg code:刚刚在 FFmpeg 代码中看到了一个非常好的实现:

/**
 * Clip a signed integer value into the 0-255 range.
 * @param a value to clip
 * @return clipped value
 */
static av_always_inline av_const uint8_t av_clip_uint8_c(int a)
{
    if (a&(~0xFF)) return (-a)>>31;
    else           return a;
}

This certainly works and reduces it to one if nicely.这当然有效,如果很好的话,它可以减少到一个。

You write that you want to avoid branching on the GPU.您写道,您希望避免在 GPU 上进行分支。 It is true, that branching can be very costly in a parallel environment because either both branches have to be evaluated or synchronization has to be applied.确实,在并行环境中分支可能非常昂贵,因为要么必须评估两个分支,要么必须应用同步。 But if the branches are small enough the code will be faster than most arithmetic.但是如果分支足够小,代码将比大多数算术更快。 The CUDA C best practices guide describes why: CUDA C 最佳实践指南描述了原因:

Sometimes, the compiler may [..] optimize out if or switch statements by using branch predication instead.有时,编译器可能会 [..] 使用分支谓词来优化 if 或 switch 语句。 In these cases, no warp can ever diverge.在这些情况下,任何经线都不会发散。 [..] [..]

When using branch predication none of the instructions whose execution depends on the controlling condition gets skipped.当使用分支预测时,其执行取决于控制条件的任何指令都不会被跳过。 Instead, each of them is associated with a per-thread condition code or predicate that is set to true or false based on the controlling condition and although each of these instructions gets scheduled for execution, only the instructions with a true predicate are actually executed.相反,它们中的每一个都与基于控制条件设置为真或假的每线程条件代码或谓词相关联,尽管这些指令中的每一个都被安排执行,但实际上只有具有真谓词的指令被执行。 Instructions with a false predicate do not write results, and also do not evaluate addresses or read operands.带有错误谓词的指令不写入结果,也不评估地址或读取操作数。

Branch predication is fast.分支预测很快。 Bloody fast.该死的快。 If you look at the intermediate PTX code generated by the optimizing compiler you will see that it is superior to even modest arithmetic.如果您查看优化编译器生成的中间 PTX 代码,您会发现它甚至优于普通的算术。 So the code like in the answer of davmac is probably as fast as it can get.因此,像 davmac 的答案中的代码可能会尽可能快。

I know you did not ask specifically about CUDA, but most of the best practices guide also applies to OpenCL and probably large parts of AMDs GPU programming.我知道您没有专门询问 CUDA,但大多数最佳实践指南也适用于 OpenCL,并且可能是 AMD 的 GPU 编程的大部分。

BTW: in virtually every case of GPU code I have ever seen most of the time is spend on memory access, not on arithmetic.顺便说一句:在我见过的几乎所有 GPU 代码的情况下,大部分时间都花在 memory 访问上,而不是在算术上。 Make sure to profile!确保配置文件! http://en.wikipedia.org/wiki/Program_optimization http://en.wikipedia.org/wiki/Program_optimization

If you just want to avoid the actual if/else, using the ? :如果您只是想避免实际的 if/else,请使用? : ? : operator: ? :运营商:

return value < 0 ? 0 : (value > 0xff ? 0xff : value);

However, in terms of efficiency this shouldn't be any different.但是,就效率而言,这应该没有什么不同。

In practice, you shouldn't worry about efficiency with something so trivial as this.在实践中,你不应该担心像这样微不足道的事情的效率。 Let the compiler do the optimization.让编译器进行优化。

You can do it without explicit if by using ?: as shown by another poster or by using interesting properties of abs() which lets you compute the maximum or minimum of two values. if使用?:或使用abs()的有趣属性,您可以在不显式的情况下执行此操作,该属性可让您计算两个值的最大值或最小值。

For example, the expression (a + abs(a))/2 returns a for positive numbers and 0 otherwise (maximum of a and 0 ).例如,表达式(a + abs(a))/2对于正数返回a ,否则返回0a0的最大值)。

This gives这给

unsigned char clip(short value)
{
  short a = (value + abs(value)) / 2;
  return (a + 255 - abs(a - 255)) / 2;
}

To convince yourself that this works, here is a test program:为了说服自己这行得通,这里有一个测试程序:

#include <stdio.h>

unsigned char clip(short value)
{
  short a = (value + abs(value)) / 2;
  return (a + 255 - abs(a - 255)) / 2;
}

void test(short value)
{
  printf("clip(%d) = %d\n", value, clip(value));
}

int main()
{
  test(0);
  test(10);
  test(-10);
  test(255);
  test(265);
  return 0;
}

When run, this prints运行时,将打印

clip(0) = 0
clip(10) = 10
clip(-10) = 0
clip(255) = 255
clip(265) = 255

Of course, one may argue that there is probably a test in abs() , but gcc -O3 for example compiles it linearly:当然,有人可能会争辩说abs()中可能有一个测试,但是gcc -O3例如线性编译它:

clip:
    movswl  %di, %edi
    movl    %edi, %edx
    sarl    $31, %edx
    movl    %edx, %eax
    xorl    %edi, %eax
    subl    %edx, %eax
    addl    %edi, %eax
    movl    %eax, %edx
    shrl    $31, %edx
    addl    %eax, %edx
    sarl    %edx
    movswl  %dx, %edx
    leal    255(%rdx), %eax
    subl    $255, %edx
    movl    %edx, %ecx
    sarl    $31, %ecx
    xorl    %ecx, %edx
    subl    %ecx, %edx
    subl    %edx, %eax
    movl    %eax, %edx
    shrl    $31, %edx
    addl    %edx, %eax
    sarl    %eax
    ret

But note that this will be much more inefficient than your original function, which compiles as:但请注意,这将比你原来的 function 效率低得多,它编译为:

clip:
    xorl    %eax, %eax
    testw   %di, %di
    js      .L1
    movl    $-1, %eax
    cmpw    $255, %di
    cmovle  %edi, %eax
.L1:
    rep
    ret

But at least it answers your question:)但至少它回答了你的问题:)

You could do a 2D lookup-table:你可以做一个二维查找表:

unsigned char clamp(short value)
{
  static const unsigned char table[256][256] = { ... }

  const unsigned char x = value & 0xff;
  const unsigned char y = (value >> 8) & 0xff;
  return table[y][x];
}

Sure this looks bizarre (a 64 KB table for this trivial computation).当然这看起来很奇怪(一个 64 KB 的表用于这个微不足道的计算)。 However, considering that you mentioned you wanted to do this on a GPU, I'm thinking the above could be a texture look-up, which I believe are pretty quick on GPUs.但是,考虑到您提到您想在 GPU 上执行此操作,我认为上述可能是纹理查找,我相信这在 GPU 上非常快。

Further, if your GPU uses OpenGL, you could of course just use the clamp builtin directly:此外,如果您的 GPU 使用 OpenGL,您当然可以直接使用内置的clamp

clamp(value, 0, 255);

This won't type-convert (there is no 8-bit integer type in GLSL, it seems), but still.这不会进行类型转换(似乎 GLSL 中没有 8 位 integer 类型),但仍然如此。

How about:怎么样:

unsigned char clamp (short value) {
    unsigned char r = (value >> 15);          /* uses arithmetic right-shift */
    unsigned char s = !!(value & 0x7f00) * 0xff;
    unsigned char v = (value & 0xff);
    return (v | s) & ~r;
}

But I seriously doubt that it executes any faster than your original version involving branches.但我严重怀疑它的执行速度是否比涉及分支的原始版本快。

Assuming a two byte short, and at the cost of readability of the code:假设一个两字节短,并且以代码的可读性为代价:

clipped_x =  (x & 0x8000) ? 0 : ((x >> 8) ? 0xFF : x);

You should time this ugly but arithmetic-only version.你应该为这个丑陋但仅限算术的版本计时。

unsigned char clamp(short value){
  short pmask = ((value & 0x4000) >> 7) | ((value & 0x2000) >> 6) |
    ((value & 0x1000) >> 5) | ((value & 0x0800) >> 4) |
    ((value & 0x0400) >> 3) | ((value & 0x0200) >> 2) |
    ((value & 0x0100) >> 1);
  pmask |= (pmask >> 1) | (pmask >> 2) | (pmask >> 3) | (pmask >> 4) |
    (pmask >> 5) | (pmask >> 6) | (pmask >> 7);
  value |= pmask;
  short nmask = (value & 0x8000) >> 8;
  nmask |= (nmask >> 1) | (nmask >> 2) | (nmask >> 3) | (nmask >> 4) |
    (nmask >> 5) | (nmask >> 6) | (nmask >> 7);
  value &= ~nmask;
  return value;
}

One way to make it efficient is to declare this function as inline to avoid function calling expense.一种提高效率的方法是将此 function 声明为内联以避免 function 调用费用。 you could also turn it into macro using tertiary operator but that will remove the return type checking by compiler.您也可以使用三元运算符将其转换为宏,但这将删除编译器的返回类型检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM