简体   繁体   中英

Clamping short to unsigned char

I have a simple C function as follows:

unsigned char clamp(short value){
    if (value < 0) return 0;
    if (value > 0xff) return 0xff;
    return value;
}

Is it possible to rewrite it without using any if / else branching while being efficient?

EDIT:

I basically wish to see if some bitwise arithmetic based implementation of clamping is possible. Objective is to process images on GPU (Graphics Processing Unit). This type of code will run on each pixel. I guess that if branches can be avoided, then overall throughput over the GPU would be higher.

A solution like (value <0? 0: ((value > 255)? 255: value) ) is simply a rehash of if/else branching with syntactic sugar. So I am not looking for it.

EDIT 2:

I can cut it down to a single if as follows but I am not able to think better:

unsigned char clamp(short value){
    int more = value >> 8;
    if(more){
        int sign = !(more >> 7);
        return sign * 0xff;
    }
    return value;
}

EDIT 3:

Just saw a very nice implementation of this in FFmpeg code:

/**
 * Clip a signed integer value into the 0-255 range.
 * @param a value to clip
 * @return clipped value
 */
static av_always_inline av_const uint8_t av_clip_uint8_c(int a)
{
    if (a&(~0xFF)) return (-a)>>31;
    else           return a;
}

This certainly works and reduces it to one if nicely.

You write that you want to avoid branching on the GPU. It is true, that branching can be very costly in a parallel environment because either both branches have to be evaluated or synchronization has to be applied. But if the branches are small enough the code will be faster than most arithmetic. The CUDA C best practices guide describes why:

Sometimes, the compiler may [..] optimize out if or switch statements by using branch predication instead. In these cases, no warp can ever diverge. [..]

When using branch predication none of the instructions whose execution depends on the controlling condition gets skipped. Instead, each of them is associated with a per-thread condition code or predicate that is set to true or false based on the controlling condition and although each of these instructions gets scheduled for execution, only the instructions with a true predicate are actually executed. Instructions with a false predicate do not write results, and also do not evaluate addresses or read operands.

Branch predication is fast. Bloody fast. If you look at the intermediate PTX code generated by the optimizing compiler you will see that it is superior to even modest arithmetic. So the code like in the answer of davmac is probably as fast as it can get.

I know you did not ask specifically about CUDA, but most of the best practices guide also applies to OpenCL and probably large parts of AMDs GPU programming.

BTW: in virtually every case of GPU code I have ever seen most of the time is spend on memory access, not on arithmetic. Make sure to profile! http://en.wikipedia.org/wiki/Program_optimization

If you just want to avoid the actual if/else, using the ? : ? : operator:

return value < 0 ? 0 : (value > 0xff ? 0xff : value);

However, in terms of efficiency this shouldn't be any different.

In practice, you shouldn't worry about efficiency with something so trivial as this. Let the compiler do the optimization.

You can do it without explicit if by using ?: as shown by another poster or by using interesting properties of abs() which lets you compute the maximum or minimum of two values.

For example, the expression (a + abs(a))/2 returns a for positive numbers and 0 otherwise (maximum of a and 0 ).

This gives

unsigned char clip(short value)
{
  short a = (value + abs(value)) / 2;
  return (a + 255 - abs(a - 255)) / 2;
}

To convince yourself that this works, here is a test program:

#include <stdio.h>

unsigned char clip(short value)
{
  short a = (value + abs(value)) / 2;
  return (a + 255 - abs(a - 255)) / 2;
}

void test(short value)
{
  printf("clip(%d) = %d\n", value, clip(value));
}

int main()
{
  test(0);
  test(10);
  test(-10);
  test(255);
  test(265);
  return 0;
}

When run, this prints

clip(0) = 0
clip(10) = 10
clip(-10) = 0
clip(255) = 255
clip(265) = 255

Of course, one may argue that there is probably a test in abs() , but gcc -O3 for example compiles it linearly:

clip:
    movswl  %di, %edi
    movl    %edi, %edx
    sarl    $31, %edx
    movl    %edx, %eax
    xorl    %edi, %eax
    subl    %edx, %eax
    addl    %edi, %eax
    movl    %eax, %edx
    shrl    $31, %edx
    addl    %eax, %edx
    sarl    %edx
    movswl  %dx, %edx
    leal    255(%rdx), %eax
    subl    $255, %edx
    movl    %edx, %ecx
    sarl    $31, %ecx
    xorl    %ecx, %edx
    subl    %ecx, %edx
    subl    %edx, %eax
    movl    %eax, %edx
    shrl    $31, %edx
    addl    %edx, %eax
    sarl    %eax
    ret

But note that this will be much more inefficient than your original function, which compiles as:

clip:
    xorl    %eax, %eax
    testw   %di, %di
    js      .L1
    movl    $-1, %eax
    cmpw    $255, %di
    cmovle  %edi, %eax
.L1:
    rep
    ret

But at least it answers your question:)

You could do a 2D lookup-table:

unsigned char clamp(short value)
{
  static const unsigned char table[256][256] = { ... }

  const unsigned char x = value & 0xff;
  const unsigned char y = (value >> 8) & 0xff;
  return table[y][x];
}

Sure this looks bizarre (a 64 KB table for this trivial computation). However, considering that you mentioned you wanted to do this on a GPU, I'm thinking the above could be a texture look-up, which I believe are pretty quick on GPUs.

Further, if your GPU uses OpenGL, you could of course just use the clamp builtin directly:

clamp(value, 0, 255);

This won't type-convert (there is no 8-bit integer type in GLSL, it seems), but still.

How about:

unsigned char clamp (short value) {
    unsigned char r = (value >> 15);          /* uses arithmetic right-shift */
    unsigned char s = !!(value & 0x7f00) * 0xff;
    unsigned char v = (value & 0xff);
    return (v | s) & ~r;
}

But I seriously doubt that it executes any faster than your original version involving branches.

Assuming a two byte short, and at the cost of readability of the code:

clipped_x =  (x & 0x8000) ? 0 : ((x >> 8) ? 0xFF : x);

You should time this ugly but arithmetic-only version.

unsigned char clamp(short value){
  short pmask = ((value & 0x4000) >> 7) | ((value & 0x2000) >> 6) |
    ((value & 0x1000) >> 5) | ((value & 0x0800) >> 4) |
    ((value & 0x0400) >> 3) | ((value & 0x0200) >> 2) |
    ((value & 0x0100) >> 1);
  pmask |= (pmask >> 1) | (pmask >> 2) | (pmask >> 3) | (pmask >> 4) |
    (pmask >> 5) | (pmask >> 6) | (pmask >> 7);
  value |= pmask;
  short nmask = (value & 0x8000) >> 8;
  nmask |= (nmask >> 1) | (nmask >> 2) | (nmask >> 3) | (nmask >> 4) |
    (nmask >> 5) | (nmask >> 6) | (nmask >> 7);
  value &= ~nmask;
  return value;
}

One way to make it efficient is to declare this function as inline to avoid function calling expense. you could also turn it into macro using tertiary operator but that will remove the return type checking by compiler.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM