是否有更好的方法来检测 16 字节标志数组中设置的位？

Question

    ALIGNTO(16) uint8_t noise_frame_flags[16] = { 0 };

    // Code detects noise and sets noise_frame_flags omitted

    __m128i xmm0            = _mm_load_si128((__m128i*)noise_frame_flags);
    bool    isNoiseToCancel = _mm_extract_epi64(xmm0, 0) | _mm_extract_epi64(xmm0, 1);

    if (isNoiseToCancel)
        cancelNoises(audiobuffer, nAudioChannels, audio_samples, noise_frame_flags);

This is a code snippet from my AV Capture tool on Linux.这是我在 Linux 上的 AV Capture 工具的代码片段。 noise_frame_flags here is an array of flags for 16-channel audio.这里的 noise_frame_flags 是 16 声道音频的标志数组。 For each channel, the corresponding byte can be either 0 or 1. 1 is indicating that the channel has some noise to cancel.对于每个通道，对应的字节可以是 0 或 1。1 表示通道有一些噪声需要消除。 For example, if noise_frame_flags[0] == 1, that means first channel noise flag is set (by the omitted code).例如，如果noise_frame_flags[0] == 1，这意味着设置了第一个通道噪声标志（通过省略的代码）。

Even if a single "flag" is set then I need to call cancelNoises .即使设置了一个“标志”，我也需要调用cancelNoises 。 And this code seems to work fine in that matter.这段代码似乎在这方面工作得很好。 As you see I used _mm_load_si128 to load a whole array of flags that is correctly aligned and then two _mm_extract_epi64 to extract "flags".如您所见，我使用_mm_load_si128加载正确对齐的整个标志数组，然后使用两个_mm_extract_epi64来提取“标志”。 My question is there a better way to do this (using pop count maybe)?我的问题是有更好的方法来做到这一点（也许使用流行计数）？

Note: ALIGNTO(16) is a macro expands to correct GCC equivalent but nicer looking.注意： ALIGNTO(16)是一个宏，它扩展为更正 GCC 等价物，但看起来更好看。

Answer 1

Yes, you eventually want a 64-bit OR to look for any non-zero bits in either half, but it's not efficient to get those uint64_t values from a 128-bit load and then extract.是的，您最终需要一个 64 位 OR 来查找任一半中的任何非零位，但是从 128 位加载中获取这些uint64_t值然后提取效率不高。

In asm you just want a mov load and a memory-source or or add , which will set ZF just like you're doing now.在 asm 中，您只需要一个mov load 和一个 memory-source or or or add ，它将像您现在一样设置 ZF 。 Two loads from the same cache line are very cheap;来自同一缓存行的两个负载非常便宜； current CPUs have at least 2/clock load throughput.当前的 CPU 至少有 2 个/时钟的负载吞吐量。 The extra ALU work to extract from a single 128-bit load is just not worth it, even if you did shuffle / por to set up for a single movq .从单个 128 位负载中提取的额外 ALU 工作是不值得的，即使您为单个movq设置了 shuffle / por也是如此。

In C++, use memcpy to do strict-aliasing safe loads of uint64_t tmp vars, then if(a | b) .在 C++ 中，使用memcpy对uint64_t tmp 变量进行严格别名安全加载，然后if(a | b) 。 This is still SIMD, just SWAR (SIMD Within A Register).这仍然是 SIMD，只是SWAR （寄存器内的 SIMD）。

add is even better than or : it can macro-fuse with most jcc instructions on Intel Sandybridge-family (but not AMD). add甚至比or更好：它可以与 Intel Sandybridge 系列（但不是 AMD）上的大多数jcc指令进行宏融合。 or can't fuse with branch instructions on any CPUs. or不能与任何 CPU 上的分支指令融合。 Since your values are 0 or 1 , we can't have a case of two non-zero values adding to produce a zero, which is why you'd normally use or for the general case.由于您的值是0或1 ，我们不能有两个非零值相加来产生零的情况，这就是您通常使用or对于一般情况的原因。

(Some addressing modes may defeat micro or macro-fusion on Intel. Or maybe it always works since there's no immediate involved. It really is possible for add rax, [mem] / jnz to go through the front-end and ROB as a single uop, and execute in the back-end as only 2 (load + add/sub-and-branch). Assuming it's about the same as cmp on my Skylake , except it does write the destination so Haswell and later can maybe keep it micro-fused even for indexed addressing modes.) （某些寻址模式可能会破坏 Intel 上的微观或宏观融合。或者它可能总是有效，因为没有直接参与。 add rax, [mem] / jnz确实有可能通过前端和 ROB 作为一个单一的uop，并在后端仅作为 2 执行（加载 + 添加/子和分支）。假设它与我的 Skylake 上的cmp大致相同，除了它确实写入了目标，因此 Haswell 和以后可能会保持微-即使对于索引寻址模式也融合了。）

    uint64_t a, b;
    memcpy(&a, noise_frame_flags+0, sizeof(a));   // strict-aliasing-safe loads
    memcpy(&b, noise_frame_flags+8, sizeof(b));   // which optimize to MOV qword
    bool  isNoiseToCancel = a + b;   // equivalent to a | b  for bool inputs

This should compile to 3 asm instructions which will decode to 2 uops total, or 3 on AMD CPUs where JCC can only fuse with cmp or test .这应该编译为 3 个 asm 指令，这些指令将总共解码为 2 个 uops，或者在 JCC 只能与cmp或test融合的 AMD CPU 上为 3 个。

union { alignas(16) uint8_t flags[16]; uint64_t chunks[2];}; would be safe in C99, but not ISO C++.在 C99 中是安全的，但在 ISO C++ 中不安全。 Most but not all C++ compilers that support Intel intrinsics define the behaviour of union type-punning.大多数但不是所有支持英特尔内在函数的 C++ 编译器都定义了联合类型双关语的行为。 (I think @jww has said SunCC doesn't.) （我认为@jww 说过 SunCC 没有。）

In C++11, you don't need a custom macro for ALIGNTO(16) , just use alignas(16) .在 C++11 中，您不需要ALIGNTO(16)的自定义宏，只需使用alignas(16) 。 Also supported in C11 if you #include <stdalign.h>如果你#include <stdalign.h>在 C11 中也支持

Alternatives:备择方案：

movdqa 16-byte load / SSE4.1 ptest xmm0, xmm0 / jnz - 4 uops on Intel CPUs, 3 on AMD. movdqa 16-byte load / SSE4.1 ptest xmm0, xmm0 / jnz - Intel CPU 上 4 uop，AMD 上 3。
Intel runs ptest as 2 uops, and it can't macro-fuse with jcc .英特尔将ptest作为 2 个微指令运行，它不能与jcc进行宏融合。
AMD CPUs run ptest as 1 uop, but it still can't fuse. AMD CPU 以 1 uop 运行ptest ，但仍然无法融合。
If you had an all-ones or all-zeros constant in a register, ptest xmm0, [mem] would work to save a uop on Intel (depending on addressing mode), but that's still 3 total.如果您在寄存器中有一个全一或全零常量， ptest xmm0, [mem]可以在 Intel 上保存一个 uop（取决于寻址模式），但总共仍然是 3 个。

PTEST is only good for checking a 32-byte array with AVX1 or AVX2 . PTEST 仅适用于使用 AVX1 或 AVX2 检查 32 字节数组。 (Surprisingly, vptest ymm only requires AVX1 ). （令人惊讶的是， vptest ymm只需要 AVX1 ）。 Then it's about break-even with AVX2 vmovdqa / vpslld ymm0, 7 / vpmovmskb eax,ymm0 / test+jnz .然后是 AVX2 vmovdqa / vpslld ymm0, 7 / vpmovmskb eax,ymm0 / test+jnz的收支平衡。 See TrentP's answer for portable GNU C native vector source code that should compile to vptest on x86 with AVX available, and maybe to something clunky on other ISAs like ARM depending on how good their horizontal OR support is.请参阅 TrentP 对可移植 GNU C 本机矢量源代码的回答，该源代码应编译为带有 AVX 的 x86 上的vptest ，并且可能编译为其他 ISA（如 ARM）上的一些笨拙的东西，具体取决于它们的水平 OR 支持的好坏。

popcnt wouldn't be useful unless you want to break down the work depending on how many bits are set. popcnt不会有用，除非您想根据设置的位数分解工作。
In that case, yes, sure, you can turn the bool array into a bitmap that you can scan easily, probably more efficient than _mm_sad_epu8 against a zeroed register to sum into two 8-byte halves.在这种情况下，是的，当然，您可以将 bool 数组转换为可以轻松扫描的位图，这可能比_mm_sad_epu8更有效地将零寄存器相加成两个 8 字节的一半。

   __m128i vflags = _mm_load_si128((__m128i*)noise_frame_flags);
   vflags = _mm_slli_epi32(vflags, 7);
   unsigned flagmask = _mm_movemask_epi8(vflags);
   if (flagmask) {
       unsigned flagcount = __builtin_popcount(flagmask);  // popcnt with -march=nehalem or higher
       unsigned first_setflag = __builtin_ctz(flagmask);   // tzcnt if available, else BSF
       vflags &= vflags - 1;   // clear lowest set bit.  blsr if compiled with -march=haswell or bdver2 or newer.
      ...
   }

(Don't actually use -march=bdver2 or -march=nehalem , unless you want to set an ISA baseline but also use -mtune=haswell or something more modern. There are individual options like -mpopcnt and -mbmi , but generally good to enable all ISA extensions that some CPU supports, so you don't miss out on useful stuff the compiler can use.) （实际上不要使用-march=bdver2或-march=nehalem ，除非您想设置 ISA 基线但也使用-mtune=haswell或更现代的东西。有单独的选项，如-mpopcnt和-mbmi ，但通常很好启用某些 CPU 支持的所有 ISA 扩展，因此您不会错过编译器可以使用的有用内容。）

Answer 2

Here's what I came up with for doing this:这是我想出这样做的：

#define VLEN 8
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));

// Constants for 128 or 256 bit registers
#if VLEN == 8
#define V(a,b,c,d,e,f,g,h) a,b,c,d,e,f,g,h
#else
#define V(a,b,c,d,e,f,g,h) a,b,c,d
#endif
#define SWAP128 V(4,5,6,7, 0,1,2,3)
#define SWAP64 V(2,3, 0,1,  6,7, 4,5)
#define SWAP32 V(1, 0,  3, 2,  5, 4,  7, 6)

static bool any(vNb x) {
    if (VLEN >= 8)
        x |= __builtin_shufflevector(x,x, SWAP128);
    x |= __builtin_shufflevector(x,x, SWAP64);
    x |= __builtin_shufflevector(x,x, SWAP32);
    return x[0];
}

With VLEN = 8, this will use 256-bit registers if the arch supports it. VLEN = 8 时，如果架构支持，这将使用 256 位寄存器。 Change to 4 to use 128 bit.更改为 4 以使用 128 位。

This should compile to a single vptest instruction.这应该编译为单个vptest指令。

是否有更好的方法来检测 16 字节标志数组中设置的位？

问题描述

2 个解决方案

解决方案1
8 已采纳 2022-06-08 08:11:48

Alternatives:备择方案：

解决方案2
2 2022-06-08 08:16:33

是否有更好的方法来检测 16 字节标志数组中设置的位？

问题描述

2 个解决方案

解决方案1 8 已采纳 2022-06-08 08:11:48

Alternatives:备择方案：

解决方案2 2 2022-06-08 08:16:33

解决方案1
8 已采纳 2022-06-08 08:11:48

解决方案2
2 2022-06-08 08:16:33