简体   繁体   English

n是负数,正数还是零? 返回1、2或4

[英]n is negative, positive or zero? return 1, 2, or 4

I'm building a PowerPC interpreter, and it works quite well. 我正在构建PowerPC解释器,并且效果很好。 In the Power architecture the condition register CR0 (EFLAGS on x86) is updated on almost any instruction. 在Power架构中,条件寄存器CR0(x86上的EFLAGS)几乎在任何指令上都会更新。 It is set like this. 这样设置。 The value of CR0 is 1, if the last result was negative, 2 if the last result was positive, 4 otherwise. 如果最后一个结果为负,则CR0的值为1,如果最后一个结果为正,则CR0的值为2,否则为4。

My first naive method to interpret this is: 我的第一个天真的解释方法是:

if (n < 0)
    cr0 = 1
else if (n > 0)
    cr0 = 2;
else
    cr0 = 4;

However I understand that all those branches won't be optimal, being run millions of times per second. 但是我知道所有这些分支并不是最优的,它们每秒运行数百万次。 I've seen some bit hacking on SO, but none seemed adeguate. 我已经看到了一些关于SO的黑客技术,但是似乎没有人幸。 For example I found many examples to convert a number to -1, 0, or 1 accordingly to the sign or 0. But how to make -1 = 1, 1 = 2, 0 = 4? 例如,我发现了很多将数字转换为-1、0或1的符号或0的示例。但是如何使-1 = 1、1 = 2、0 = 4? I'm asking for the help of the Bit Hackers... 我正在寻求Bit Hackers的帮助...

Thanks in advance 提前致谢

Update: First of all: thanks guys, you've been great. 更新:首先:谢谢大家,你很棒。 I'll test all of your codes carefully for speed and you'll be the first to know who's the winner. 我将仔细测试所有代码的速度,您将是第一个知道谁是赢家的人。

@jalf: About your first advice, I wasn't actually calculating CR0 on every instruction. @jalf:关于您的第一个建议,我实际上并不是在每条指令上都计算CR0。 I was rather keeping a lastResult variable, and when (and if) a following instruction asked for a flag, do the comparison. 我宁愿保留lastResult变量,然后(如果(如果))以下指令要求标记,则进行比较。 Three main motivations took me back to "everytime" update: 三个主要动机使我回到了“每次”更新:

  1. On PPC you're not forced to update CR0 like on x86 (where ADD always change EFLAGS, even if not needed), you have two flavours of ADD, one updating. 在PPC上,您不必像在x86上一样(即使不需要,ADD总是更改EFLAGS)来更新CR0,您有两种添加方式,一种是更新。 If the compiler chooses to use the updating one, it means that it's going to use the CR0 at some point, so there no point at delaying... 如果编译器选择使用更新版本,则意味着它将在某个时候使用CR0,因此不会延迟...
  2. There's a particularly painful instruction called mtcrf, that enables you to change the CR0 arbitrarly. 有一条特别痛苦的指令mtcrf,它使您可以任意更改CR0。 You can even set it to 7, with no arithmetic meaning... This just destroys the possibility of keeping a "lastResult" variable. 您甚至可以将其设置为7,没有任何算术意义……这只会破坏保留“ lastResult”变量的可能性。

First, if this variable is to be updated after (nearly) every instruction, the obvious piece of advice is this: 首先,如果要在(几乎)每条指令之后更新此变量,则显而易见的建议是:

don't

Only update it when the subsequent instructions need its value. 仅在后续说明需要它的值时更新它。 At any other time, there's no point in updating it. 在任何其他时间,都没有必要对其进行更新。

But anyway, when we update it, what we want is this behavior: 但是无论如何,当我们更新它时,我们想要的是这种行为:

R < 0  => CR0 == 0b001 
R > 0  => CR0 == 0b010
R == 0 => CR0 == 0b100

Ideally, we won't need to branch at all. 理想情况下,我们根本不需要分支。 Here's one possible approach: 这是一种可能的方法:

  1. Set CR0 to the value 1 . 将CR0设置为值1 (if you really want speed, investigate whether this can be done without fetching the constant from memory. Even if you have to spend a couple of instructions on it, it may well be worth it) (如果您真的想要速度,请调查是否可以在不从内存中获取常量的情况下完成此操作。即使您必须花一些指令,也很值得)
  2. If R >= 0, left shift by one bit. 如果R> = 0,则左移一位。
  3. If R == 0, left shift by one bit 如果R == 0,则左移一位

Where steps 2 and 3 can be transformed to eliminate the "if" part 可以转换步骤2和3以消除“如果”部分

CR0 <<= (R >= 0);
CR0 <<= (R == 0);

Is this faster? 这样更快吗? I don't know. 我不知道。 As always, when you are concerned about performance, you need to measure, measure, measure. 与往常一样,当您关注性能时,需要进行衡量,衡量和衡量。

However, I can see a couple of advantages of this approach: 但是,我可以看到这种方法的两个优点:

  1. we avoid branches completely 我们完全避免分支
  2. we avoid memory loads/stores. 我们避免内存加载/存储。
  3. the instructions we rely on (bit shifting and comparison) should have low latency, which isn't always the case for multiplication, for example. 我们所依赖的指令(移位和比较)应具有低延迟,例如,乘法并不总是这样。

The downside is that we have a dependency chain between all three lines: Each modifies CR0, which is then used in the next line. 缺点是我们在所有三行之间都有一个依赖链:每条都修改CR0,然后在下一行中使用它。 This limits instruction-level parallelism somewhat. 这在某种程度上限制了指令级并行性。

To minimize this dependency chain, we could do something like this instead: 为了最小化此依赖链,我们可以改为执行以下操作:

CR0 <<= ((R >= 0) + (R == 0));

so we only have to modify CR0 once, after its initialization. 因此我们只需在CR0初始化后修改一次即可。

Or, doing everything in a single line: 或者,在一行中完成所有操作:

CR0 = 1 << ((R >= 0) + (R == 0));

Of course, there are a lot of possible variations of this theme, so go ahead and experiment. 当然,此主题可能有很多变体,因此请继续尝试。

Lots of answers that are approximately "don't" already, as usual :) You want the bit hack? 像往常一样,很多答案已经差不多是“不”了:)您想要一点技巧吗? You will get it. 你会得到的。 Then feel free to use it or not as you see fit. 然后随意使用或不使用认为合适的。

You could use that mapping to -1, 0 and 1 ( sign ), and then do this: 您可以使用到-1、0和1( sign )的映射,然后执行以下操作:

return 7 & (0x241 >> ((sign(x) + 1) * 4));

Which is essentially using a tiny lookup table. 这实际上是使用一个很小的查找表。

Or the "naive bithack": 或“天真的bithack”:

int y = ((x >> 31) & 1) | ((-x >> 31) & 2)
return (~(-y >> 31) & 4) | y;

The first line maps x < 0 to 1, x > 0 to 2 and x == 0 to 0. The second line then maps y == 0 to 4 and y != 0 to y. 第一行将x < 0映射为1,将x > 0映射为2,并且x == 0映射为0。然后第二行将y == 0映射为4,并且y != 0映射为y。


And of course it has a sneaky edge case for x = 0x80000000 which is mapped to 3. Oops. 当然,它具有x = 0x80000000的暗流边缘情况,它映射到3。糟糕。 Well let's fix that: 好吧,让我们修复一下:

int y = ((x >> 31) & 1) | ((-x >> 31) & 2)
y &= 1 | ~(y << 1);  // remove the 2 if odd
return (~(-y >> 31) & 4) | y;

The following expression is a little cryptic, but not excessively so, and it looks to be something the compiler can optimize pretty easily: 下面的表达式有点神秘,但并不过分,看起来编译器可以很容易地对其进行优化:

cr0 = 4 >> ((2 * (n < 0)) + (n > 0));

Here's what GCC 4.6.1 for an x86 target compiles it to with -O2 : 这是针对x86目标的GCC 4.6.1将其编译为-O2

xor ecx, ecx
mov eax, edx
sar eax, 31
and eax, 2
test    edx, edx
setg    cl
add ecx, eax
mov eax, 4
sar eax, cl

And VC 2010 with /Ox looks pretty similar: 带有/Ox VC 2010看起来非常相似:

xor ecx, ecx
test eax, eax
sets cl
xor edx, edx
test eax, eax
setg dl
mov eax, 4
lea ecx, DWORD PTR [edx+ecx*2]
sar eax, cl

The version using if tests compiles to assembly that uses jumps with either of these compilers. 使用if测试的版本将编译为使用这两个编译器之一进行跳转的程序集。 Of course, you'll never really be sure what any particular compiler is going to do with whatever particular bit of code you choose unless you actually examine the output. 当然,除非您实际检查输出,否则您永远不会真正确定任何特定的编译器将对您选择的任何特定代码执行什么操作。 My expression is cryptic enough that unless it was really a performance critical bit of code, I might still go with with if statement version. 我的表达式很隐晦,除非它确实是对性能至关重要的代码,否则我可能仍会使用if语句版本。 Since you need to set the CR0 register frequently, I think it might be worth measuring if this expression helps at all. 由于您需要经常设置CR0寄存器,因此我认为如果此表达式有帮助,可能值得测量。

gcc with no optimization 没有优化的gcc

        movl    %eax, 24(%esp)  ; eax has result of reading n
        cmpl    $0, 24(%esp)
        jns     .L2
        movl    $1, 28(%esp)
        jmp     .L3
.L2:
        cmpl    $0, 24(%esp)
        jle     .L4
        movl    $2, 28(%esp)
        jmp     .L3
.L4:
        movl    $4, 28(%esp)
.L3:

With -O2: 使用-O2:

        movl    $1, %edx       ; edx = 1
        cmpl    $0, %eax
        jl      .L2            ; n < 0
        cmpl    $1, %eax       ; n < 1
        sbbl    %edx, %edx     ; edx = 0 or -1
        andl    $2, %edx       ; now 0 or 2
        addl    $2, %edx       ; now 2 or 4
.L2:
        movl    %edx, 4(%esp)

I don't think you are likely to do much better 我认为您可能不会做得更好

I was working on this one when my computer crashed. 当我的计算机崩溃时,我正在研究这个。

int cr0 = (-(n | n-1) >> 31) & 6;
cr0 |= (n >> 31) & 5;
cr0 ^= 4;

Here's the resulting assembly (for Intel x86): 这是生成的程序集(对于Intel x86):

PUBLIC  ?tricky@@YAHH@Z                                 ; tricky
; Function compile flags: /Ogtpy
_TEXT   SEGMENT
_n$ = 8                                                 ; size = 4
?tricky@@YAHH@Z PROC                                    ; tricky
; Line 18
        mov     ecx, DWORD PTR _n$[esp-4]
        lea     eax, DWORD PTR [ecx-1]
        or      eax, ecx
        neg     eax
        sar     eax, 31                                 ; 0000001fH
; Line 19
        sar     ecx, 31                                 ; 0000001fH
        and     eax, 6
        and     ecx, 5
        or      eax, ecx
; Line 20
        xor     eax, 4
; Line 22
        ret     0
?tricky@@YAHH@Z ENDP                                    ; tricky

And a complete exhaustive test which is also reasonably suitable for benchmarking: 完整的详尽测试也很适合基准测试:

#include <limits.h>

int direct(int n)
{
    int cr0;
    if (n < 0)
        cr0 = 1;
    else if (n > 0)
        cr0 = 2;
    else
        cr0 = 4;
    return cr0;
}

const int shift_count = sizeof(int) * CHAR_BIT - 1;
int tricky(int n)
{
    int cr0 = (-(n | n-1) >> shift_count) & 6;
    cr0 |= (n >> shift_count) & 5;
    cr0 ^= 4;
    return cr0;
}

#include <iostream>
#include <iomanip>
int main(void)
{
    int i = 0;
    do {
        if (direct(i) != tricky(i)) {
            std::cerr << std::hex << i << std::endl;
            return i;
        }
    } while (++i);
    return 0;
}

If there is a faster method, the compiler probably already is using it. 如果有更快的方法,则编译器可能已经在使用它。

Keep your code short and simple; 使您的代码简短明了; that makes the optimizer most effective. 使优化器最有效。

The simple straightforward solution does surprisingly well speed-wise: 简单,直接的解决方案在速度方面出奇地出色:

cr0 = n? (n < 0)? 1: 2: 4;

x86 Assembly (produced by VC++ 2010, flags /Ox ): x86 Assembly(由VC ++ 2010生产,标志为/Ox ):

PUBLIC  ?tricky@@YAHH@Z                                 ; tricky
; Function compile flags: /Ogtpy
_TEXT   SEGMENT
_n$ = 8                                                 ; size = 4
?tricky@@YAHH@Z PROC                                    ; tricky
; Line 26
        mov     eax, DWORD PTR _n$[esp-4]
        test    eax, eax
        je      SHORT $LN3@tricky
        xor     ecx, ecx
        test    eax, eax
        setns   cl
        lea     eax, DWORD PTR [ecx+1]
; Line 31
        ret     0
$LN3@tricky:
; Line 26
        mov     eax, 4
; Line 31
        ret     0
?tricky@@YAHH@Z ENDP                                    ; tricky

For a completely unportable approach, I wonder if this might have any speed benefit: 对于一种完全不可移植的方法,我想知道这是否对速度有好处:

void func(signed n, signed& cr0) {
    cr0 = 1 << (!(unsigned(n)>>31)+(n==0));
}

mov         ecx,eax  ;with MSVC10, all optimizations except inlining on.
shr         ecx,1Fh  
not         ecx  
and         ecx,1  
xor         edx,edx  
test        eax,eax  
sete        dl  
mov         eax,1  
add         ecx,edx  
shl         eax,cl  
mov         ecx,dword ptr [cr0]  
mov         dword ptr [ecx],eax  

compared to your code on my machine: 与您在我的机器上的代码相比:

test        eax,eax            ; if (n < 0)
jns         func+0Bh (401B1Bh)  
mov         dword ptr [ecx],1  ; cr0 = 1;
ret                            ; cr0 = 2; else cr0 = 4; }
xor         edx,edx            ; else if (n > 0)
test        eax,eax  
setle       dl  
lea         edx,[edx+edx+2]  
mov         dword ptr [ecx],edx ; cr0 = 2; else cr0 = 4; }
ret  

I don't know much at all about assembly, so I can't say for sure if this would have any benefit (or even if mine has any jumps. I see no instructions beginning with j anyway). 我对汇编一无所知,所以我不能肯定地说这是否会有好处(或者即使我的有任何跳跃。我也看不到任何以j开头的指令)。 As always, (and as everyone else said a million times) PROFILE. 与往常一样(以及其他所有人所说的一百万次)。

I doubt this is faster than say Jalf or Ben's, but I didn't see any that took advantage of the fact that on x86 all negative numbers have a certain bit set, and I figured I'd throw one out. 我怀疑这比说Jalf或Ben的速度快,但是我没有看到任何利用x86上所有负数都设置了某个位这一事实的优势,我想我会把它扔掉。

[EDIT]BenVoigt suggests cr0 = 4 >> ((n != 0) + (unsigned(n) >> 31)); [EDIT] BenVoigt建议cr0 = 4 >> ((n != 0) + (unsigned(n) >> 31)); to remove the logical negation, and my tests show that is a vast improvement. 消除逻辑上的否定,我的测试表明这是一个巨大的进步。

以下是我的尝试。

int cro = 4 >> (((n > 0) - (n < 0)) % 3 + (n < 0)*3);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM