简体   繁体   English

在许多 64 位位掩码上分别计算每个位位置,使用 AVX 而不是 AVX2

[英]Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better. (相关: 如何在 Sandy Bridge 上的一系列整数中快速将位计数到单独的 bin 中?是此内容的早期副本,有一些不同的答案。编者注:这里的答案可能更好。

Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t : Improve column population count algorithm )此外,类似问题的 AVX2 版本,对于比一个uint64_t宽得多的整行位有许多 bin: 改进列人口计数算法


I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target ) of 64 short integers (uint16) based on a simple rule:我正在用 C 语言进行一个项目,我需要通过数千万个掩码(类型为 ulong(64 位))并根据一个简单规则更新一个由 64 个短整数 (uint16) 组成的数组(称为target ):

// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
    if (mask & (1ull << i)) {
        target[i]++
    }
}

The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second.问题是我需要在数千万个面具上进行上述循环,而且我需要在不到一秒钟的时间内完成。 Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.想知道是否有任何方法可以加快速度,例如使用某种表示上述循环的特殊汇编指令。

Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds.目前我在 ubuntu 14.04(i7-2670QM,支持 AVX,不支持 AVX2)上使用 gcc 4.8.4 编译并运行以下代码,耗时约 2 秒。 Would love to make it run under 200ms.希望让它在 200 毫秒内运行。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];

int main(int argc, char *argv[]) {
    int i, j;
    unsigned long x = 123;
    unsigned long m = 1;
    char *p = malloc(8 * 10000000);
    if (!p) {
        printf("failed to allocate\n");
        exit(0);
    }
    memset(p, 0xff, 80000000);
    printf("p=%p\n", p);
    unsigned long *pLong = (unsigned long*)p;
    double start = getTS();
    for (j = 0; j < 10000000; j++) {
        m = 1;
        for (i = 0; i < 64; i++) {
            if ((pLong[j] & m) == m) {
                target[i]++;
            }
            m = (m << 1);
        }
    }
    printf("took %f secs\n", getTS() - start);
    return 0;
}

Thanks in advance!提前致谢!

On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3 , your code runs in 500ms.在我的系统上,一个 4 岁的 MacBook(2.7 GHz intel core i5)带有clang-900.0.39.2 -O3 ,您的代码运行时间为 500 毫秒。

Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.只需将内部测试更改为if ((pLong[j] & m) != 0)节省 30%,运行时间为if ((pLong[j] & m) != 0)

Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1;将内部进一步简化为target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.没有测试将其降低到 280 毫秒。

Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.进一步的改进似乎需要更先进的技术,例如将位解包为 8 个 ulong 的块并并行添加,一次处理 255 个 ulong。

Here is an improved version using this method.这是使用此方法的改进版本。 it runs in 45ms on my system.它在我的系统上运行 45 毫秒。

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}

int main(int argc, char *argv[]) {
    unsigned int target[64] = { 0 };
    unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
    int i, j;

    if (!pLong) {
        printf("failed to allocate\n");
        exit(1);
    }
    memset(pLong, 0xff, sizeof(*pLong) * 10000000);
    printf("p=%p\n", (void*)pLong);
    double start = getTS();
    uint64_t inflate[256];
    for (i = 0; i < 256; i++) {
        uint64_t x = i;
        x = (x | (x << 28));
        x = (x | (x << 14));
        inflate[i] = (x | (x <<  7)) & 0x0101010101010101ULL;
    }
    for (j = 0; j < 10000000 / 255 * 255; j += 255) {
        uint64_t b[8] = { 0 };
        for (int k = 0; k < 255; k++) {
            uint64_t u = pLong[j + k];
            for (int kk = 0; kk < 8; kk++, u >>= 8)
                b[kk] += inflate[u & 255];
        }
        for (i = 0; i < 64; i++)
            target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
    }
    for (; j < 10000000; j++) {
        uint64_t m = 1;
        for (i = 0; i < 64; i++) {
            target[i] += (pLong[j] >> i) & 1;
            m <<= 1;
        }
    }
    printf("target = {");
    for (i = 0; i < 64; i++)
        printf(" %d", target[i]);
    printf(" }\n");
    printf("took %f secs\n", getTS() - start);
    return 0;
}

The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 .在答案中调查并解释了将字节膨胀为 64 位长的技术: https : //stackoverflow.com/a/55059914/4593267 I made the target array a local variable, as well as the inflate array, and I print the results to ensure the compiler will not optimize the computations away.我将target数组和inflate数组设为局部变量,并打印结果以确保编译器不会优化计算。 In a production version you would compute the inflate array separately.在生产版本中,您将单独计算inflate数组。

Using SIMD directly might provide further improvements at the expense of portability and readability.直接使用 SIMD 可能会以牺牲可移植性和可读性为代价提供进一步的改进。 This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture.这种优化通常最好留给编译器,因为它可以为目标架构生成特定的代码。 Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.除非性能至关重要并且基准测试证明这是一个瓶颈,否则我总是倾向于通用解决方案。

A different solution by njuffa provides similar performance without the need for a precomputed array. njuffa 的不同解决方案提供了类似的性能,而无需预先计算的数组。 Depending on your compiler and hardware specifics, it might be faster.根据您的编译器和硬件细节,它可能会更快。

Related:有关的:

Also: https://github.com/mklarqvist/positional-popcount has SSE blend, various AVX2, various AVX512 including Harley-Seal which is great for large arrays, and various other algorithms for positional popcount.另外: https : //github.com/mklarqvist/positional-popcount有 SSE 混合、各种 AVX2、各种 AVX512,包括非常适合大型阵列的 Harley-Seal,以及各种其他用于位置 popcount 的算法。 Possibly only for uint16_t , but most could be adapted for other word widths.可能仅适用于uint16_t ,但大多数都适用于其他字宽。 I think the algorithm I propose below is what they call adder_forest .我认为我在下面提出的算法就是他们所说的adder_forest


Your best bet is SIMD, using AVX1 on your Sandybridge CPU.最好的选择是 SIMD,在 Sandybridge CPU 上使用 AVX1。 Compilers aren't smart enough to auto-vectorize your loop-over-bits for you, even if you write it branchlessly to give them a better chance.编译器不够聪明,无法为您自动矢量化您的循环位,即使您无分支地编写它以给它们更好的机会。

And unfortunately not smart enough to auto-vectorize the fast version that gradually widens and adds.不幸的是,它不够聪明,无法自动矢量化逐渐扩大和添加的快速版本。


See is there an inverse instruction to the movemask instruction in intel avx2?看到intel avx2 中是否有与 movemask 指令相反的指令? for a summary of bitmap -> vector unpack methods for different sizes.位图的总结 -> 不同尺寸的矢量解包方法。 Ext3h's suggestion in another answer is good: Unpack bits to something narrower than the final count array gives you more elements per instruction. Ext3h 在另一个答案中的建议很好:将位解压缩为比最终计数数组更窄的内容,可为每条指令提供更多元素。 Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to accumulate into the 32-bit counter array. Bytes 使用 SIMD 是高效的,然后您可以在不溢出的情况下执行多达 255 个垂直paddb ,然后解包累积到 32 位计数器数组中。

It only takes 4x 16-byte __m128i vectors to hold all 64 uint8_t elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.只需要 4 个 16 字节的__m128i向量来保存所有 64 个uint8_t元素,因此这些累加器可以保留在寄存器中,仅在外部循环中扩展到 32 位计数器时才添加到内存中。

The unpack doesn't have to be in-order : you can always shuffle target[] once at the very end, after accumulating all the results.解包不必按顺序进行:在累积所有结果后,您始终可以在最后对target[]一次 shuffle。

The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb ( _mm_shuffle_epi8 ).可以展开内部循环以从 64 或 128 位向量加载开始,并使用pshufb ( _mm_shuffle_epi8 ) 以 4 或 8 种不同方式解包。


An even better strategy is to widen gradually一个更好的策略是逐渐扩大

Starting with 2-bit accumulators, then mask/shift to widen those to 4-bit.从 2 位累加器开始,然后屏蔽/移位以将其扩大到 4 位。 So in the inner-most loop most of the operations are working with "dense" data, not "diluting" it too much right away.因此,在最内层循环中,大多数操作都使用“密集”数据,而不是立即“稀释”太多。 Higher information / entropy density means that each instruction does more useful work.更高的信息/熵密度意味着每条指令都会做更多有用的工作。

Using SWAR techniques for 32x 2-bit add inside scalar or SIMD registers is easy / cheap because we need to avoid the possibility of carry out the top of an element anyway.SWAR技术用于标量或 SIMD 寄存器内的 32x 2 位加法很容易/便宜,因为无论如何我们都需要避免执行元素顶部的可能性。 With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.使用正确的 SIMD,我们会丢失这些计数,使用 SWAR 我们会破坏下一个元素。

uint64_t x = *(input++);        // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555;  // 0b...01010101;

uint64_t lo = x & even_1bits;
uint64_t hi = (x>>1) & even_1bits;            // or use ANDN before shifting to avoid a MOV copy

accum2_lo += lo;   // can do up to 3 iterations of this without overflow
accum2_hi += hi;   // because a 2-bit integer overflows at 4

Then you repeat up to 4 vectors of 4-bit elements, then 8 vectors of 8-bit elements, then you should widen all the way to 32 and accumulate into the array in memory because you'll run out of registers anyway, and this outer outer loop work is infrequent enough that we don't need to bother with going to 16-bit.然后你最多重复 4 个 4 位元素向量,然后是 8 个 8 位元素向量,然后你应该一直加宽到 32 并累积到内存中的数组中,因为无论如何你都会用完寄存器,这外层外循环工作很少,我们不需要费心去使用 16 位。 (Especially if we manually vectorize). (特别是如果我们手动矢量化)。

Biggest downside: this doesn't auto-vectorize, unlike @njuffa's version.最大的缺点:与@njuffa 的版本不同,这不会自动矢量化。 But with gcc -O3 -march=sandybridge for AVX1 (then running the code on Skylake), this running scalar 64-bit is actually still slightly faster than 128-bit AVX auto-vectorized asm from @njuffa's code.但是对于 AVX1 的gcc -O3 -march=sandybridge (然后在 Skylake 上运行代码),这个运行的 64 位标量实际上仍然比来自@njuffa 代码的 128 位 AVX 自动矢量化 asm 稍

But that's timing on Skylake, which has 4 scalar ALU ports (and mov-elimination), while Sandybridge lacks mov-elimination and only has 3 ALU ports, so the scalar code will probably hit back-end execution-port bottlenecks.但这是 Skylake 的时间,它有 4 个标量 ALU 端口(和 mov-elimination),而 Sandybridge 没有 mov-elimination 并且只有 3 个 ALU 端口,所以标量代码可能会遇到后端执行端口瓶颈。 (But SIMD code may be nearly as fast, because there's plenty of AND / ADD mixed with the shifts, and SnB does have SIMD execution units on all 3 of its ports that have any ALUs on them. Haswell just added port 6, for scalar-only including shifts and branches.) (但 SIMD 代码可能几乎一样快,因为有大量 AND / ADD 与移位混合在一起,并且 SnB 在其所有 3 个端口上都有 SIMD 执行单元,这些端口上有任何 ALU。Haswell 刚刚添加了端口 6,用于标量- 仅包括班次和分支。)

With good manual vectorization, this should be a factor of almost 2 or 4 faster.通过良好的手动矢量化,这应该快 2 或 4 倍。

But if you have to choose between this scalar or @njuffa's with AVX2 autovectorization, @njuffa's is faster on Skylake with -march=native但是,如果您必须在此标量或带有 AVX2 自动向量化的 @njuffa 之间进行选择,@njuffa 在 Skylake 上使用-march=native会更快

If building on a 32-bit target is possible/required, this suffers a lot (without vectorization because of using uint64_t in 32-bit registers), while vectorized code barely suffers at all (because all the work happens in vector regs of the same width).如果可以/需要在 32 位目标上构建,这会受到很大影响(因为在 32 位寄存器中使用 uint64_t 而没有向量化),而向量化代码几乎不会受到影响(因为所有工作都发生在相同的向量 regs 中)宽度)。

// TODO: put the target[] re-ordering somewhere
// TODO: cleanup for N not a multiple of 3*4*21 = 252
// TODO: manual vectorize with __m128i, __m256i, and/or __m512i

void sum_gradual_widen (const uint64_t *restrict input, unsigned int *restrict target, size_t length)
{
    const uint64_t *endp = input + length - 3*4*21;     // 252 masks per outer iteration
    while(input <= endp) {
        uint64_t accum8[8] = {0};     // 8-bit accumulators
        for (int k=0 ; k<21 ; k++) {
            uint64_t accum4[4] = {0};  // 4-bit accumulators can hold counts up to 15.  We use 4*3=12
            for(int j=0 ; j<4 ; j++){
                uint64_t accum2_lo=0, accum2_hi=0;
                for(int i=0 ; i<3 ; i++) {  // the compiler should fully unroll this
                    uint64_t x = *input++;    // load a new bitmask
                    const uint64_t even_1bits = 0x5555555555555555;
                    uint64_t lo = x & even_1bits; // 0b...01010101;
                    uint64_t hi = (x>>1) & even_1bits;  // or use ANDN before shifting to avoid a MOV copy
                    accum2_lo += lo;
                    accum2_hi += hi;   // can do up to 3 iterations of this without overflow
                }

                const uint64_t even_2bits = 0x3333333333333333;
                accum4[0] +=  accum2_lo       & even_2bits;  // 0b...001100110011;   // same constant 4 times, because we shift *first*
                accum4[1] += (accum2_lo >> 2) & even_2bits;
                accum4[2] +=  accum2_hi       & even_2bits;
                accum4[3] += (accum2_hi >> 2) & even_2bits;
            }
            for (int i = 0 ; i<4 ; i++) {
                accum8[i*2 + 0] +=   accum4[i] & 0x0f0f0f0f0f0f0f0f;
                accum8[i*2 + 1] +=  (accum4[i] >> 4) & 0x0f0f0f0f0f0f0f0f;
            }
        }

        // char* can safely alias anything.
        unsigned char *narrow = (uint8_t*) accum8;
        for (int i=0 ; i<64 ; i++){
            target[i] += narrow[i];
        }
    }
    /* target[0] = bit 0
     * target[1] = bit 8
     * ...
     * target[8] = bit 1
     * target[9] = bit 9
     * ...
     */
    // TODO: 8x8 transpose
}

We don't care about order, so accum4[0] has 4-bit accumulators for every 4th bit, for example.例如,我们不关心顺序,因此accum4[0]每 4 位就有一个 4 位累加器。 The final fixup needed (but not yet implemented) at the very end is an 8x8 transpose of the uint32_t target[64] array, which can be done efficiently using unpck and vshufps with only AVX1.最后需要(但尚未实现)的最终修复uint32_t target[64]数组的 8x8 转置,这可以使用 unpck 和vshufps高效完成,仅使用 AVX1。 ( Transpose an 8x8 float using AVX/AVX2 ). 使用 AVX/AVX2 转置 8x8 浮点数)。 And also a cleanup loop for the last up to 251 masks.以及最后多达 251 个面具的清理循环。

We can use any SIMD element width to implement these shifts;我们可以使用任何 SIMD 元素宽度来实现这些转换; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)我们必须对低于 16 位的宽度进行屏蔽(SSE/AVX 没有字节粒度移位,最小只有 16 位。)

Benchmark results on Arch Linux i7-6700k from @njuffa's test harness, with this added.来自@njuffa 测试工具的Arch Linux i7-6700k 基准测试结果,并添加了这一点。 ( Godbolt ) N = (10000000 / (3*4*21) * 3*4*21) = 9999864 (ie 10000000 rounded down to a multiple of the 252 iteration "unroll" factor, so my simplistic implementation is doing the same amount of work, not counting re-ordering target[] which it doesn't do, so it does print mismatch results. But the printed counts match another position of the reference array.) ( Godbolt ) N = (10000000 / (3*4*21) * 3*4*21) = 9999864 (即 10000000 向下舍入为 252 迭代“展开”因子的倍数,所以我的简单实现是做相同的数量的工作,不计算它不做的重新排序target[] ,所以它打印不匹配的结果。但打印的计数匹配引用数组的另一个位置。)

I ran the program 4x in a row (to make sure the CPU was warmed up to max turbo) and took one of the runs that looked good (none of the 3 times abnormally high).我连续运行了 4 次程序(以确保 CPU 预热到最大涡轮增压)并进行了一次看起来不错的运行(3 次都没有异常高)。

ref: the best bit-loop (next section)参考:最佳位循环(下一节)
fast: @njuffa's code.快:@njuffa 的代码。 (auto-vectorized with 128-bit AVX integer instructions). (使用 128 位 AVX 整数指令自动矢量化)。
gradual: my version (not auto-vectorized by gcc or clang, at least not in the inner loop.) gcc and clang fully unroll the inner 12 iterations.渐进:我的版本(不是由 gcc 或 clang 自动矢量化,至少不在内部循环中。)gcc 和 clang 完全展开内部 12 次迭代。

  • gcc8.2 -O3 -march=sandybridge -fpie -no-pie
    ref: 0.331373 secs, fast: 0.011387 secs, gradual: 0.009966 secs参考:0.331373 秒,快:0.011387 秒,渐进:0.009966 秒
  • gcc8.2 -O3 -march=sandybridge -fno-pie -no-pie
    ref: 0.397175 secs, fast: 0.011255 secs, gradual: 0.010018 secs参考:0.397175 秒,快:0.011255 秒,渐进:0.010018 秒
  • clang7.0 -O3 -march=sandybridge -fpie -no-pie
    ref: 0.352381 secs, fast: 0.011926 secs, gradual: 0.009269 secs (very low counts for port 7 uops, clang used indexed addressing for stores)参考:0.352381 秒,快速:0.011926 秒,渐进:0.009269 秒(端口 7 uops 的计数非常低,clang 使用索引寻址存储)
  • clang7.0 -O3 -march=sandybridge -fno-pie -no-pie
    ref: 0.293014 secs , fast: 0.011777 secs, gradual: 0.009235 secs参考:0.293014 秒,快:0.011777 秒,渐进:0.009235 秒

-march=skylake (allowing AVX2 for 256-bit integer vectors) helps both, but @njuffa's most because more of it vectorizes (including its inner-most loop): -march=skylake (允许 AVX2 用于 256 位整数向量)对两者都有帮助,但 @njuffa 最有用,因为它有更多的向量化(包括其最内部的循环):

  • gcc8.2 -O3 -march=skylake -fpie -no-pie
    ref: 0.328725 secs, fast: 0.007621 secs, gradual: 0.010054 secs (gcc shows no gain for "gradual", only "fast")参考:0.328725 秒,快速:0.007621 秒,渐进:0.010054 秒(gcc 显示“渐进”没有增益,只有“快速”)
  • gcc8.2 -O3 -march=skylake -fno-pie -no-pie
    ref: 0.333922 secs, fast: 0.007620 secs, gradual: 0.009866 secs参考:0.333922 秒,快:0.007620 秒,渐进:0.009866 秒

  • clang7.0 -O3 -march=skylake -fpie -no-pie
    ref: 0.260616 secs, fast: 0.007521 secs, gradual: 0.008535 secs (IDK why gradual is faster than -march=sandybridge; it's not using BMI1 andn . I guess because it's using 256-bit AVX2 for the k=0..20 outer loop with vpaddq )参考:0.260616 秒,快速:0.007521 秒,渐进:0.008535 秒(IDK 为什么渐进比 -march=sandybridge 快;它没有使用 BMI1 andn 。我猜是因为它使用 256 位 AVX2 作为 k=0 的外部。使用vpaddq循环)

  • clang7.0 -O3 -march=skylake -fno-pie -no-pie
    ref: 0.259159 secs , fast: 0.007496 secs , gradual: 0.008671 secs参考:0.259159 秒快:0.007496 秒,渐进:0.008671 秒

Without AVX, just SSE4.2: ( -march=nehalem ), bizarrely clang's gradual is faster than with AVX / tune=sandybridge.没有 AVX,只有 SSE4.2: ( -march=nehalem ),奇怪的是,clang 的渐变比 AVX / tune=sandybridge 更快。 "fast" is only barely slower than with AVX. “快速”仅比 AVX 慢一点。

  • gcc8.2 -O3 -march=skylake -fno-pie -no-pie
    ref: 0.337178 secs, fast: 0.011983 secs, gradual: 0.010587 secs参考:0.337178 秒,快:0.011983 秒,渐进:0.010587 秒
  • clang7.0 -O3 -march=skylake -fno-pie -no-pie
    ref: 0.293555 secs , fast: 0.012549 secs, gradual: 0.008697 secs参考:0.293555 秒,快:0.012549 秒,渐进:0.008697 秒

-fprofile-generate / -fprofile-use help some for GCC, especially for the "ref" version where it doesn't unroll at all by default. -fprofile-generate / -fprofile-use对 GCC 有一些帮助,特别是对于默认情况下根本不展开的“ref”版本。

I highlighted the best, but often they're within measurement noise margin of each other.我强调了最好的,但它们通常在彼此的测量噪声容限内。 It's unsurprising the -fno-pie -no-pie was sometimes faster: indexing static arrays with [disp32 + reg] is not an indexed addressing mode, just base + disp32, so it doesn't ever unlaminate on Sandybridge-family CPUs.不足为奇的是-fno-pie -no-pie有时更快:使用[disp32 + reg]索引静态数组不是索引寻址模式,只是 base + disp32,因此它永远不会在 Sandybridge 系列 CPU 上分层。

But with gcc sometimes -fpie was faster;但是使用 gcc 有时-fpie更快; I didn't check but I assume gcc just shot itself in the foot somehow when 32-bit absolute addressing was possible.我没有检查,但我认为当 32 位绝对寻址是可能的时,gcc 只是以某种方式向自己开枪。 Or just innocent-looking differences in code-gen happened to cause alignment or uop-cache problems;或者只是代码生成中看似无辜的差异碰巧导致对齐或 uop 缓存问题; I didn't check in detail.我没有仔细检查。


For SIMD, we can simply do 2 or 4x uint64_t in parallel, only accumulating horizontally in the final step where we widen bytes to 32-bit elements.对于 SIMD,我们可以简单地并行执行 2 或 4x uint64_t ,仅在我们将字节扩展为 32 位元素的最后一步中水平累加。 (Perhaps by shuffling in-lane and then using pmaddubsw with a multiplier of _mm256_set1_epi8(1) to add horizontal byte pairs into 16-bit elements.) (也许通过改组 in-lane,然后使用pmaddubsw_mm256_set1_epi8(1)的乘数将水平字节对添加到 16 位元素中。)

TODO: manually-vectorized __m128i and __m256i (and __m512i ) versions of this. TODO:手动矢量化的__m128i__m256i (和__m512i )版本。 Should be close to 2x, 4x, or even 8x faster than the "gradual" times above.应该比上面的“渐进”时间快接近 2 倍、4 倍甚至 8 倍。 Probably HW prefetch can still keep up with it, except maybe an AVX512 version with data coming from DRAM, especially if there's contention from other threads.可能 HW 预取仍然可以跟上它,除了数据来自 DRAM 的 AVX512 版本,特别是如果有来自其他线程的争用。 We do a significant amount of work per qword we read.我们为我们阅读的每个 qword 做了大量的工作。


Obsolete code: improvements to the bit-loop过时的代码:位循环的改进

Your portable scalar version can be improved, too, speeding it up from ~1.92 seconds ( with a 34% branch mispredict rate overall , with the fast loops commented out!) to ~0.35sec ( clang7.0 -O3 -march=sandybridge ) with a properly random input on 3.9GHz Skylake.您的便携式标量版本也可以改进,将其从 ~1.92 秒(整体分支预测错误率为 34% ,快速循环注释掉!)加速到 ~0.35 秒( clang7.0 -O3 -march=sandybridge )在 3.9GHz Skylake 上具有适当的随机输入。 Or 1.83 sec for the branchy version with != 0 instead of == m , because compilers fail to prove that m always has exactly 1 bit set and/or optimize accordingly.或 1.83 秒的分支版本与!= 0而不是== m ,因为编译器无法证明m总是恰好设置了 1 位和/或相应地优化。

(vs. 0.01 sec for @njuffa's or my fast version above, so this is pretty useless in an absolute sense, but it's worth mentioning as a general optimization example of when to use branchless code.) (与上面@njuffa 或我的快速版本的 0.01 秒相比,这在绝对意义上是非常无用的,但作为何时使用无分支代码的一般优化示例值得一提。)

If you expect a random mix of zeros and ones, you want something branchless that won't mispredict.如果您期望零和 1 的随机组合,您需要一些不会错误预测的无分支的东西。 Doing += 0 for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.对为零的元素执行+= 0可以避免这种情况,也意味着 C 抽象机肯定会接触该内存,而不管数据如何。

Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++ version, they'd have to use a masked store like x86 vmaskmovps to avoid a non-atomic read / rewrite of unmodified elements of target .编译器不允许发明写入,所以如果他们想自动矢量化你的if() target[i]++版本,他们必须使用像 x86 vmaskmovps这样的掩码存储来避免非原子读/重写target的未修改元素。 So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.因此,一些可以自动向量化纯标量代码的假设未来编译器将更容易使用此方法。

Anyway, one way to write this is target[i] += (pLong[j] & m != 0);无论如何,一种写法是target[i] += (pLong[j] & m != 0); , using bool->int conversion to get a 0 / 1 integer. , 使用 bool->int 转换得到一个 0 / 1 整数。

But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1 .但是,如果我们只是移动数据并使用&1隔离低位,我们会为 x86(可能对于大多数其他架构)获得更好的 asm Compilers are kinda dumb and don't seem to spot this optimization.编译器有点笨,似乎没有发现这种优化。 They do nicely optimize away the extra loop counter, and turn m <<= 1 into add same,same to efficiently left shift, but they still use xor-zero / test / setne to create a 0 / 1 integer.他们很好地优化掉了额外的循环计数器,并将m <<= 1转换为add same,same有效左移add same,same ,但他们仍然使用 xor-zero / test / setne来创建一个 0 / 1 整数。

An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using @chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t ):像这样的内部循环的编译效率稍高一些(但仍然比我们使用 SSE2 或 AVX,甚至使用@chrqlie 的查找表的标量差得多,当像这样重复使用时,它会在 L1d 中保持热状态,允许在uint64_t使用 SWAR):

    for (int j = 0; j < 10000000; j++) {
#if 1  // extract low bit directly
        unsigned long long tmp = pLong[j];
        for (int i=0 ; i<64 ; i++) {   // while(tmp) could mispredict, but good for sparse data
            target[i] += tmp&1;
            tmp >>= 1;
        }
#else // bool -> int shifting a mask
        unsigned long m = 1;
        for (i = 0; i < 64; i++) {
            target[i]+= (pLong[j] & m) != 0;
            m = (m << 1);
        }
#endif

Note that unsigned long is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64.请注意, unsigned long不能保证是 64 位类型,并且不在 x86-64 System V x32(64 位模式下的 ILP32)和 Windows x64 中。 Or in 32-bit ABIs like i386 System V.或者在像 i386 System V 这样的 32 位 ABI 中。

Compiled on the Godbolt compiler explorer by gcc, clang, and ICC , it's 1 fewer uops in the loop with gcc. 由 gcc、clang 和 ICC 在 Godbolt 编译器资源管理器上编译,使用 gcc 在循环中少了 1 个 uops。 But all of them are just plain scalar, with clang and ICC unrolling by 2.但它们都只是普通的标量,clang 和 ICC 展开了 2。

# clang7.0 -O3 -march=sandybridge
.LBB1_2:                            # =>This Loop Header: Depth=1
   # outer loop loads a uint64 from the src
    mov     rdx, qword ptr [r14 + 8*rbx]
    mov     rsi, -256
.LBB1_3:                            #   Parent Loop BB1_2 Depth=1
                                    # do {
    mov     edi, edx
    and     edi, 1                              # isolate the low bit
    add     dword ptr [rsi + target+256], edi   # and += into target

    mov     edi, edx
    shr     edi
    and     edi, 1                              # isolate the 2nd bit
    add     dword ptr [rsi + target+260], edi

    shr     rdx, 2                              # tmp >>= 2;

    add     rsi, 8
    jne     .LBB1_3                       # } while(offset += 8 != 0);

This is slightly better than we get from test / setnz .这比我们从test / setnz得到的略好。 Without unrolling, bt / setc might have been equal, but compilers are bad at using bt to implement bool (x & (1ULL << n)) , or bts to implement x |= 1ULL << n .如果没有展开, bt / setc可能是相等的,但是编译器不setc使用bt来实现bool (x & (1ULL << n)) ,或者使用bts来实现x |= 1ULL << n

If many words have their highest set bit far below bit 63, looping on while(tmp) could be a win .如果许多单词的最高设置位远低于位 63,则在while(tmp)上循环可能是一个胜利 Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it.如果大部分时间只节省 0 到 4 次迭代,分支错误预测就不值得,但如果它经常节省 32 次迭代,那真的很值得。 Maybe unroll in the source so the loop only tests tmp every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2 / jnz .也许在源代码中展开,所以循环只测试tmp每 2 次迭代(因为编译器不会为你做那个转换),但是循环分支可以是shr rdx, 2 / jnz

On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input.在 Sandybridge 系列上,这是前端每 2 位输入的 11 个融合域 uops。 ( add [mem], reg with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info ). add [mem], reg具有非索引寻址模式的add [mem], reg微熔丝负载 + ALU 和存储地址 + 存储数据,其他所有内容都是单 uop。add/jcc 宏熔丝。请参阅 Agner Fog 的指南,和https://stackoverflow.com/tags/x86/info )。 So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles.所以它应该以每 2 位 3 个周期 = 每 96 个周期一个 uint64_t 的速度运行。 (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later). (Sandybridge 不会在其循环缓冲区内部“展开”,因此与 Haswell 及更高版本不同,非 4 的倍数 uop 计数基本上是四舍五入的)。

vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit.与 gcc 的未展开版本相比,每 1 位 7 uop = 每位 2 个周期。 If you compiled with gcc -O3 -march=native -fprofile-generate / test-run / gcc -O3 -march=native -fprofile-use , profile-guided optimization would enable loop unrolling.如果您使用gcc -O3 -march=native -fprofile-generate / test-run / gcc -O3 -march=native -fprofile-use ,则配置文件引导的优化将启用循环展开。

This is probably slower than a branchy version on perfectly predictable data like you get from memset with any repeating byte pattern .对于完全可预测的数据,这可能比分支版本慢,就像您从带有任何重复字节模式的memset获得的数据一样 I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand() .我建议用从快速 PRNG(如 SSE2 xorshift+)中随机生成的数据填充您的数组,或者如果您只是为计数循环计时,则使用您想要的任何内容,如rand()

One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t variables.即使没有 AVX,显着加快速度的一种方法是将数据拆分为最多 255 个元素的块,并在普通uint64_t变量中按字节累积位计数。 Since the source data has 64 bits, we need an array of 8 byte-wise accumulators.由于源数据有 64 位,我们需要一个包含 8 个字节累加器的数组。 The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57;第一个累加器对位置 0、8、16、... 56 中的位进行计数,第二个累加器对位置 1、9、17、... 57 中的位进行计数; and so on.等等。 After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target counts.处理完一个数据块后,我们将计数从逐字节累加器传输到target计数中。 A function to update the target counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS is the number of bits in the source data:可以根据上面的描述以简单的方式对最多 255 个数字的块更新target计数的函数进行编码,其中BITS是源数据中的位数:

/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
    int jj, k, kk;
    uint64_t byte_wise_sum [BITS/8] = {0};
    for (jj = lo; jj < hi; jj++) {
        uint64_t t = pLong[jj];
        for (k = 0; k < BITS/8; k++) {
            byte_wise_sum[k] += t & 0x0101010101010101;
            t >>= 1;
        }
    }
    /* accumulate byte sums into target */
    for (k = 0; k < BITS/8; k++) {
        for (kk = 0; kk < BITS; kk += 8) {
            target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
        }
    }
}

The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below.整个 ISO-C99 程序,至少应该能够在 Windows 和 Linux 平台上运行,如下所示。 It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version.它使用 PRNG 初始化源数据,根据提问者的参考实现执行正确性检查,并对参考代码和加速版本进行基准测试。 On my machine (Intel Xeon E3-1270 v2 @ 3.50 GHz), when compiled with MSVS 2010 at full optimization ( /Ox ), the output of the program is:在我的机器上(Intel Xeon E3-1270 v2 @ 3.50 GHz),当使用 MSVS 2010 以完全优化( /Ox )编译时,程序的输出是:

p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs

where ref refers to the asker's original solution.其中ref指的是提问者的原始解决方案。 The speed-up here is about a factor 74x.这里的加速大约是 74 倍。 Different speed-ups will be observed with other (and especially newer) compilers.其他(尤其是较新的)编译器将观察到不同的加速。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

/*
  From: geo <gmars...@gmail.com>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

#define N          (10000000)
#define BITS       (64)
#define BLOCK_SIZE (255)

/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
    int jj, k, kk;
    uint64_t byte_wise_sum [BITS/8] = {0};
    for (jj = lo; jj < hi; jj++) {
        uint64_t t = pLong[jj];
        for (k = 0; k < BITS/8; k++) {
            byte_wise_sum[k] += t & 0x0101010101010101;
            t >>= 1;
        }
    }
    /* accumulate byte sums into target */
    for (k = 0; k < BITS/8; k++) {
        for (kk = 0; kk < BITS; kk += 8) {
            target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
        }
    }
}

int main (void) 
{
    double start_ref, stop_ref, start, stop;
    uint64_t *pLong;
    unsigned int target_ref [BITS] = {0};
    unsigned int target [BITS] = {0};
    int i, j;

    pLong = malloc (sizeof(pLong[0]) * N);
    if (!pLong) {
        printf("failed to allocate\n");
        return EXIT_FAILURE;
    }
    printf("p=%p\n", pLong);

    /* init data */
    for (j = 0; j < N; j++) {
        pLong[j] = KISS64;
    }

    /* count bits slowly */
    start_ref = second();
    for (j = 0; j < N; j++) {
        uint64_t m = 1;
        for (i = 0; i < BITS; i++) {
            if ((pLong[j] & m) == m) {
                target_ref[i]++;
            }
            m = (m << 1);
        }
    }
    stop_ref = second();

    /* count bits fast */
    start = second();
    for (j = 0; j < N / BLOCK_SIZE; j++) {
        sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
    }
    sum_block (pLong, target, j * BLOCK_SIZE, N);
    stop = second();

    /* check whether result is correct */
    for (i = 0; i < BITS; i++) {
        if (target[i] != target_ref[i]) {
            printf ("error @ %d: res=%u ref=%u\n", i, target[i], target_ref[i]);
        }
    }

    /* print benchmark results */
    printf("ref took %f secs, fast took %f secs\n", stop_ref - start_ref, stop - start);
    return EXIT_SUCCESS;
}

For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.对于初学者来说,解压缩这些位的问题,因为说真的,您不想单独测试每个位。

So just follow the following strategy for unpacking the bits into bytes of a vector: https://stackoverflow.com/a/24242696/2879325因此,只需遵循以下策略将位解包为向量的字节: https : //stackoverflow.com/a/24242696/2879325

Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register.现在您已将每个位填充为 8 位,您可以一次对最多 255 个位掩码的块执行此操作,并将它们全部累加到单个向量寄存器中。 After that, you would have to expect potential overflows, so you need to transfer.在那之后,您将不得不预期潜在的溢出,因此您需要转移。

After each block of 255, unpack again to 32bit, and add into the array.每块255后,再解压成32bit,加入数组。 (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators). (您不必精确地做 255,只需一些小于 256 的方便数字,以避免字节累加器溢出)。

At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.每个位掩码有 8 条指令(AVX2 的每个较低和较高的 32 位各有 4 条指令)——或者如果你有 AVX512 可用的一半——你应该能够在最近的 CPU 上实现大约 10 亿条每秒和核心的吞吐量.


typedef uint64_t T;
const size_t bytes = 8;
const size_t bits = bytes * 8;
const size_t block_size = 128;

static inline __m256i expand_bits_to_bytes(uint32_t x)
{
    __m256i xbcast = _mm256_set1_epi32(x);    // we only use the low 32bits of each lane, but this is fine with AVX2

    // Each byte gets the source byte containing the corresponding bit
    const __m256i shufmask = _mm256_set_epi64x(
        0x0303030303030303, 0x0202020202020202,
        0x0101010101010101, 0x0000000000000000);
    __m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);

    const __m256i andmask = _mm256_set1_epi64x(0x8040201008040201);  // every 8 bits -> 8 bytes, pattern repeats.
    __m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);

    // this is the extra step: byte == 0 ? 0 : -1
    return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
}

void bitcount_vectorized(const T *data, uint32_t accumulator[bits], const size_t count)
{
    for (size_t outer = 0; outer < count - (count % block_size); outer += block_size)
    {
        __m256i temp_accumulator[bits / 32] = { _mm256_setzero_si256() };
        for (size_t inner = 0; inner < block_size; ++inner) {
            for (size_t j = 0; j < bits / 32; j++)
            {
                const auto unpacked = expand_bits_to_bytes(static_cast<uint32_t>(data[outer + inner] >> (j * 32)));
                temp_accumulator[j] = _mm256_sub_epi8(temp_accumulator[j], unpacked);
            }
        }
        for (size_t j = 0; j < bits; j++)
        {
            accumulator[j] += ((uint8_t*)(&temp_accumulator))[j];
        }
    }
    for (size_t outer = count - (count % block_size); outer < count; outer++)
    {
        for (size_t j = 0; j < bits; j++)
        {
            if (data[outer] & (T(1) << j))
            {
                accumulator[j]++;
            }
        }
    }
}

void bitcount_naive(const T *data, uint32_t accumulator[bits], const size_t count)
{
    for (size_t outer = 0; outer < count; outer++)
    {
        for (size_t j = 0; j < bits; j++)
        {
            if (data[outer] & (T(1) << j))
            {
                accumulator[j]++;
            }
        }
    }
}

Depending on the chose compiler, the vectorized form achieved roughly a factor 25 speedup over the naive one.根据选择的编译器,矢量化形式比原始形式实现了大约 25 倍的加速。

On a Ryzen 5 1600X, the vectorized form roughly achieved the predicted throughput of ~600,000,000 elements per second.在 Ryzen 5 1600X 上,矢量化形式大致实现了每秒约 600,000,000 个元素的预测吞吐量。

Surprisingly, this is actually still 50% slower than the solution proposed by @njuffa.令人惊讶的是,这实际上仍然比@njuffa 提出的解决方案慢 50%。

See

Efficient Computation of Positional Population Counts Using SIMD Instructions by Marcus DR Klarqvist, Wojciech Muła, Daniel Lemire (7 Nov 2019) Marcus DR Klarqvist、Wojciech Muła、Daniel Lemire使用 SIMD 指令有效计算位置人口计数(2019 年 11 月 7 日)

Faster Population Counts using AVX2 Instructions by Wojciech Muła, Nathan Kurz, Daniel Lemire (23 Nov 2016). Wojciech Muła、Nathan Kurz、Daniel Lemire(2016 年 11 月 23 日)使用 AVX2 指令加快人口计数

Basically, each full adder compresses 3 inputs to 2 outputs.基本上,每个全加器将 3 个输入压缩为 2 个输出。 So one can eliminate an entire 256-bit word for the price of 5 logic instructions.因此,可以以 5 条逻辑指令的价格消除整个 256 位字。 The full adder operation could be repeated until registers become exhausted.可以重复全加器操作,直到寄存器耗尽。 Then results in the registers are accumulated (as seen in most of the other answers).然后累加寄存器中的结果(如大多数其他答案所示)。

Positional popcnt for 16-bit subwords is implemented here: https://github.com/mklarqvist/positional-popcount 16 位子字的位置 popcnt 在这里实现: https : //github.com/mklarqvist/positional-popcount

// Carry-Save Full Adder (3:2 compressor)
b ^= a;
a ^= c;
c ^= b; // xor sum
b |= a;
b ^= c; // carry

Note: the accumulate step for positional-popcnt is more expensive than for normal simd popcnt .注意: positional-popcnt 的累加步骤比普通的simd popcnt更昂贵。 Which I believe makes it feasible to add a couple of half-adders to the end of the CSU, it might pay to go all the way up to 256 words before accumulating.我相信这使得在 CSU 末尾添加几个半加器成为可能,在累加之前一直到 256 个字可能是值得的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM