[英]High Order Bits - Take them and make a uint64_t into a uint8_t
Let's say you have a uint64_t and care only about the high order bit for each byte in your uint64_t. 假设您有一个uint64_t,并且只关心uint64_t中每个字节的高位。 Like so:
像这样:
uint32_t: 0000 ... 1000 0000 1000 0000 1000 0000 1000 0000 ---> 0000 1111 uint32_t:0000 ... 1000 0000 1000 0000 1000 0000 1000 0000 ---> 0000 1111
Is there a faster way than: 有没有比以下更快的方式:
return
(
((x >> 56) & 128)+
((x >> 49) & 64)+
((x >> 42) & 32)+
((x >> 35) & 16)+
((x >> 28) & 8)+
((x >> 21) & 4)+
((x >> 14) & 2)+
((x >> 7) & 1)
)
Aka shifting x, masking, and adding the correct bit for each byte? Aka移位x,屏蔽并为每个字节添加正确的位? This will compile to a lot of assembly and I'm looking for a quicker way... The machine I'm using only has up to SSE2 instructions and I failed to find helpful SIMD ops.
这将编译到很多程序集,我正在寻找一个更快的方法...我使用的机器只有SSE2指令,我找不到有用的SIMD操作。
Thanks for the help. 谢谢您的帮助。
As I mentioned in a comment, pmovmskb
does what you want. 正如我在评论中提到的,
pmovmskb
做你想要的。 Here's how you could use it: 以下是您可以使用它的方法:
MMX + SSE1: MMX + SSE1:
movq mm0, input ; input can be r/m
pmovmskb output, mm0 ; output must be r
SSE2: SSE2:
movq xmm0, input
pmovmskb output, xmm0
And I looked up the new way 我抬头看着新的方式
BMI2: BMI2:
mov rax, 0x8080808080808080
pext output, input, rax ; input must be r
return ((x & 0x8080808080808080) * 0x2040810204081) >> 56;
works. 作品。 The & selects the bits you want to keep.
&选择要保留的位。 The multiplications all the bits into the most significant byte, and the shift moves them to the least significant byte.
将所有位乘以最高有效字节,并将移位移到最低有效字节。 Since multiplication is fast on most modern CPUs this shouldn't be much slower than using assembly.
由于在大多数现代CPU上乘法很快,因此这不应该比使用汇编慢得多。
And here's how to do it using SSE intrinsics: 以下是使用SSE内在函数的方法:
#include <xmmintrin.h>
#include <inttypes.h>
#include <stdio.h>
int main (void)
{
uint64_t x
= 0b0000000010000000000000001000000000000000100000000000000010000000;
printf ("%x\n", _mm_movemask_pi8 ((__m64) x));
return 0;
}
Works fine with: 适用于:
gcc -msse
You don't need all the separate logical ANDs, you can simplify it to: 您不需要所有单独的逻辑AND,您可以将其简化为:
x &= 0x8080808080808080;
return (x >> 7) | (x >> 14) | (x >> 21) | (x >> 28) |
(x >> 35) | (x >> 42) | (x >> 49) | (x >> 56);
(assuming that the function return type is uint8_t
). (假设函数返回类型是
uint8_t
)。
You can also convert that to an unrolled loop: 您还可以将其转换为展开循环:
uint8_t r = 0;
x &= 0x8080808080808080;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
return r;
I'm not sure which will perform better in practice, though I'd tend to bet on the first - the second might produce shorter code but with a long dependency chain. 我不确定哪个在实践中会表现得更好,尽管我倾向于在第一个上下注 - 第二个可能产生更短的代码,但具有长的依赖链。
First you don't really need so many operations. 首先,你真的不需要这么多操作。 You can act on more than one bit at a time:
您可以一次执行多个操作:
x = (x >> 7) & 0x0101010101010101; // 0x0101010101010101
x |= x >> 28; // 0x????????11111111
x |= x >> 14; // 0x????????????5555
x |= x >> 7; // 0x??????????????FF
return x & 0xFF;
An alternative is to use modulo to do sideway additions. 另一种方法是使用modulo进行横向添加。 The first thing is to note that
x % n
is the sum of the digits in base n+1
, so if n+1
is 2^k
, you are adding groups of k bits. 首先要注意的是
x % n
是基数n+1
中的数字之和,因此如果n+1
是2^k
,则添加k位组。 If you start with t = (x >> 7) & 0x0101010101010101
like above, you want to sum groups of 7 bits, thus t % 127
would be the solution. 如果你从上面的
t = (x >> 7) & 0x0101010101010101
,你想要对7位的组进行求和,因此t % 127
将是解决方案。 But t%127
works only for result up to 126. 0x8080808080808080 and anything above will gives incorrect result. 但是
t%127
仅适用于高达126的结果.0x8080808080808080以上任何内容都会产生错误的结果。 I've tried some corrections, none where easy. 我尝试了一些修正,没有一个容易。
Trying to use modulo to put us in the situation where there is just the last step of the previous algorithm to was possible. 试图使用modulo将我们置于只有前一算法的最后一步的情况下才有可能。 What we want is to keep the two less significant bits, and then have the sum of the other one, grouped by 14. So
我们想要的是保持两个不太重要的位,然后得到另一个的总和,按14分组。所以
ull t = (x & 0x8080808080808080) >> 7;
ull u = (t & 3) | (((t>>2) % 0x3FFF) << 2);
return (u | (u>>7)) & 0xFF;
But t>>2 is t/4 and << 2 is multiplying by 4. And if we have (a % b)*c == (a*c % b*c)
, thus (((t>>2) % 0x3FFF) << 2)
is (t & ~3) % 0xFFFC
. 但是t >> 2是t / 4而<< 2乘以4.如果我们有
(a % b)*c == (a*c % b*c)
,那么(((t>>2) % 0x3FFF) << 2)
是(t & ~3) % 0xFFFC
。 But we also have the fact that a + b%c = (a+b)%c if it is less than c. 但是,如果小于c,我们还有a + b%c =(a + b)%c的事实。 So we have simply
u = t % FFFC
. 所以我们只有
u = t % FFFC
。 Giving: 赠送:
ull t = ((x & 0x8080808080808080) >> 7) % 0xFFFC;
return (t | (t>>7)) & 0xFF;
这似乎有效:
return (x & 0x8080808080808080) % 127;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.