简体   繁体   English

高阶位 - 取出它们并将uint64_t转换为uint8_t

[英]High Order Bits - Take them and make a uint64_t into a uint8_t

Let's say you have a uint64_t and care only about the high order bit for each byte in your uint64_t. 假设您有一个uint64_t,并且只关心uint64_t中每个字节的高位。 Like so: 像这样:

uint32_t: 0000 ... 1000 0000 1000 0000 1000 0000 1000 0000 ---> 0000 1111 uint32_t:0000 ... 1000 0000 1000 0000 1000 0000 1000 0000 ---> 0000 1111

Is there a faster way than: 有没有比以下更快的方式:

   return
   (
     ((x >> 56) & 128)+
     ((x >> 49) &  64)+
     ((x >> 42) &  32)+
     ((x >> 35) &  16)+
     ((x >> 28) &   8)+
     ((x >> 21) &   4)+
     ((x >> 14) &   2)+
     ((x >>  7) &   1)
   )

Aka shifting x, masking, and adding the correct bit for each byte? Aka移位x,屏蔽并为每个字节添加正确的位? This will compile to a lot of assembly and I'm looking for a quicker way... The machine I'm using only has up to SSE2 instructions and I failed to find helpful SIMD ops. 这将编译到很多程序集,我正在寻找一个更快的方法...我使用的机器只有SSE2指令,我找不到有用的SIMD操作。

Thanks for the help. 谢谢您的帮助。

As I mentioned in a comment, pmovmskb does what you want. 正如我在评论中提到的, pmovmskb做你想要的。 Here's how you could use it: 以下是您可以使用它的方法:

MMX + SSE1: MMX + SSE1:

movq mm0, input ; input can be r/m
pmovmskb output, mm0 ; output must be r

SSE2: SSE2:

movq xmm0, input
pmovmskb output, xmm0

And I looked up the new way 我抬头看着新的方式

BMI2: BMI2:

mov rax, 0x8080808080808080
pext output, input, rax ; input must be r
return ((x & 0x8080808080808080) * 0x2040810204081) >> 56;

works. 作品。 The & selects the bits you want to keep. &选择要保留的位。 The multiplications all the bits into the most significant byte, and the shift moves them to the least significant byte. 将所有位乘以最高有效字节,并将移位移到最低有效字节。 Since multiplication is fast on most modern CPUs this shouldn't be much slower than using assembly. 由于在大多数现代CPU上乘法很快,因此这不应该比使用汇编慢得多。

And here's how to do it using SSE intrinsics: 以下是使用SSE内在函数的方法:

#include <xmmintrin.h>
#include <inttypes.h>
#include <stdio.h>

int main (void)
{
  uint64_t x
  = 0b0000000010000000000000001000000000000000100000000000000010000000;

  printf ("%x\n", _mm_movemask_pi8 ((__m64) x));
  return 0;
}

Works fine with: 适用于:

gcc -msse

You don't need all the separate logical ANDs, you can simplify it to: 您不需要所有单独的逻辑AND,您可以将其简化为:

x &= 0x8080808080808080;
return (x >>  7) | (x >> 14) | (x >> 21) | (x >> 28) |
       (x >> 35) | (x >> 42) | (x >> 49) | (x >> 56);

(assuming that the function return type is uint8_t ). (假设函数返回类型是uint8_t )。

You can also convert that to an unrolled loop: 您还可以将其转换为展开循环:

uint8_t r = 0;

x &= 0x8080808080808080;

x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
x >>= 7; r |= x;
return r;

I'm not sure which will perform better in practice, though I'd tend to bet on the first - the second might produce shorter code but with a long dependency chain. 我不确定哪个在实践中会表现得更好,尽管我倾向于在第一个上下注 - 第二个可能产生更短的代码,但具有长的依赖链。

First you don't really need so many operations. 首先,你真的不需要这么多操作。 You can act on more than one bit at a time: 您可以一次执行多个操作:

x = (x >> 7) & 0x0101010101010101; // 0x0101010101010101
x |= x >> 28;                      // 0x????????11111111
x |= x >> 14;                      // 0x????????????5555
x |= x >>  7;                      // 0x??????????????FF
return x & 0xFF;

An alternative is to use modulo to do sideway additions. 另一种方法是使用modulo进行横向添加。 The first thing is to note that x % n is the sum of the digits in base n+1 , so if n+1 is 2^k , you are adding groups of k bits. 首先要注意的是x % n是基数n+1中的数字之和,因此如果n+12^k ,则添加k位组。 If you start with t = (x >> 7) & 0x0101010101010101 like above, you want to sum groups of 7 bits, thus t % 127 would be the solution. 如果你从上面的t = (x >> 7) & 0x0101010101010101 ,你想要对7位的组进行求和,因此t % 127将是解决方案。 But t%127 works only for result up to 126. 0x8080808080808080 and anything above will gives incorrect result. 但是t%127仅适用于高达126的结果.0x8080808080808080以上任何内容都会产生错误的结果。 I've tried some corrections, none where easy. 我尝试了一些修正,没有一个容易。

Trying to use modulo to put us in the situation where there is just the last step of the previous algorithm to was possible. 试图使用modulo将我们置于只有前一算法的最后一步的情况下才有可能。 What we want is to keep the two less significant bits, and then have the sum of the other one, grouped by 14. So 我们想要的是保持两个不太重要的位,然后得到另一个的总和,按14分组。所以

ull t = (x & 0x8080808080808080) >> 7;
ull u = (t & 3) | (((t>>2) % 0x3FFF) << 2);
return (u | (u>>7)) & 0xFF;

But t>>2 is t/4 and << 2 is multiplying by 4. And if we have (a % b)*c == (a*c % b*c) , thus (((t>>2) % 0x3FFF) << 2) is (t & ~3) % 0xFFFC . 但是t >> 2是t / 4而<< 2乘以4.如果我们有(a % b)*c == (a*c % b*c) ,那么(((t>>2) % 0x3FFF) << 2)(t & ~3) % 0xFFFC But we also have the fact that a + b%c = (a+b)%c if it is less than c. 但是,如果小于c,我们还有a + b%c =(a + b)%c的事实。 So we have simply u = t % FFFC . 所以我们只有u = t % FFFC Giving: 赠送:

ull t = ((x & 0x8080808080808080) >> 7) % 0xFFFC;
return (t | (t>>7)) & 0xFF;

这似乎有效:

return (x & 0x8080808080808080) % 127;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM