使用 const 运行时除数进行快速 integer 除法和取模

Question

int n_attrs = some_input_from_other_function() // [2..5000]
vector<int> corr_indexes; // size = n_attrs * n_attrs
vector<char> selected; // szie = n_attrs
vector<pair<int,int>> selectedPairs; // size = n_attrs / 2
// vector::reserve everything here
...
// optimize the code below
const int npairs = n_attrs * n_attrs;
selectedPairs.clear();
for (int i = 0; i < npairs; i++) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache
    
    // below lines are called max 2500 times, so they're insignificant
    selected[x] = true;
    selected[y] = true;
    selectedPairs.emplace_back(x, y);
    if (selectedPairs.size() == n_attrs / 2) break;
}

我有一个 function，看起来像这样。 瓶颈在

    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;

n_attrs在循环中是常量，所以我希望找到一种方法来加速这个循环。 corr_indexes[i], n_attrs > 0, < max_int32 。 编辑：请注意n_attrs不是编译时常量。

我该如何优化这个循环？ 不允许有额外的图书馆。 此外，他们是否有任何方法来并行化此循环（CPU 或 GPU 都可以，在此循环之前一切都已经在 GPU memory 上）。

Answer 1

我将在// optimize the code below之后优化部分：

服用n_attrs
生成一个 function 字符串，如下所示：

void dynamicFunction(MyType & selectedPairs, Foo & selected)
{
    const int npairs = @@ * @@;
    selectedPairs.clear();
    for (int i = 0; i < npairs; i++) {
        const int x = corr_indexes[i] / @@;
        const int y = corr_indexes[i] % @@;
        if (selected[x] || selected[y]) continue; // fit inside L1 cache
    
        // below lines are called max 2500 times, so they're insignificant
        selected[x] = true;
        selected[y] = true;
        selectedPairs.emplace_back(x, y);
        if (selectedPairs.size() == @@ / 2) 
            break;
    }
}

将所有@@替换为n_attrs的值
编译它，生成 DLL
链接和调用 function

因此 n_attrs 是 DLL 的编译时常量值，编译器可以自动对值进行大部分优化，例如：

当 x 是 2 的幂值时，执行n&(x-1)而不是n%x
移位和乘法而不是除法
也许还有其他优化，例如使用预先计算的 x 和 y 索引展开循环（因为 x 是已知的）

当更多部分在编译时已知时，紧密循环中的一些 integer 数学运算更容易被编译器 SIMDify/矢量化。

如果您的 CPU 是 AMD，您甚至可以尝试使用魔术浮点运算来代替未知/未知除法来获得矢量化。

通过缓存n_attrs的所有（或大部分）值，您可以消除以下延迟：

字符串生成
编译
文件（DLL）读取（假设一些面向对象的 DLL 包装）

如果要优化的部分将在 GPU 中运行，那么 CUDA/OpenCL 实现很可能已经以浮点方式执行 integer 除法（以保持 SIMD 路径被占用而不是在 Z157DB7ZD3 除法上被序列化）或仅在 Z157DB7ZDF3 除法上被序列化66699B directly as SIMD integer operations so you may just use it as it is in the GPU, except the std::vector which is not supported by all C++ CUDA compilers (and not in OpenCL kernel). 这些与主机环境相关的部分可以在执行 kernel（不包括 emplace_back 或与在 GPU 中工作的结构交换的部分）之后计算。

Answer 2

正如已经建议的那样，编译器（在我的例子中是 gnu 10.2）似乎有一些模数或除以常数的启发式方法。 这是一个简单的例子：

$ cat t12.cpp
#include <vector>
#include <cstdlib>

void test(std::vector<std::pair<int,int> > &selectedPairs, std::vector<int> &corr_indexes, std::vector<bool> &selected, int n_attrs){

  // optimize the code below
  const int npairs = n_attrs * n_attrs;
  selectedPairs.clear();
  for (int i = 0; i < npairs; i++) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache

    // below lines are called max 2500 times, so they're insignificant
    selected[x] = true;
    selected[y] = true;
    selectedPairs.push_back(std::pair<int,int>(x, y));
    if (selectedPairs.size() == n_attrs / 2) break;
  }
}

template <int n_attrs>
void ttest(std::vector<std::pair<int,int> > &selectedPairs, std::vector<int> &corr_indexes, std::vector<bool> &selected){

  // optimize the code below
  const int npairs = n_attrs * n_attrs;
  selectedPairs.clear();
  for (int i = 0; i < npairs; i++) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache

    // below lines are called max 2500 times, so they're insignificant
    selected[x] = true;
    selected[y] = true;
    selectedPairs.push_back(std::pair<int,int>(x, y));
    if (selectedPairs.size() == n_attrs / 2) break;
  }
}

int main(int argc, char *argv[]){

  int n_attrs = 3;
  if (argc > 1) n_attrs = atoi(argv[1]);
  std::vector<int> corr_indexes(n_attrs*n_attrs, 3);
  std::vector<std::pair<int,int> > selectedPairs;
  std::vector<bool> selected(n_attrs, true);
  // could replace following with a switch-case jump-table
  if (n_attrs == 10000) ttest<10000>(selectedPairs, corr_indexes, selected);
  else test(selectedPairs, corr_indexes, selected, n_attrs);
}
$ g++ -O3 t12.cpp -o t12
$ time ./t12 9999

real    0m0.494s
user    0m0.439s
sys     0m0.055s
$ time ./t12 10000

real    0m0.310s
user    0m0.259s
sys     0m0.051s
$

（CPU：i5-4690K）

如果您将其与此处建议的 boost 预处理器魔法结合起来构建一个 switch-case 跳转表，您可能会获得一些好处/加速。

Answer 3

I am restricting my comments to integer division, because to first order the modulo operation in C++ can be viewed and implemented as an integer division plus backmultiply and subtraction, although in some cases, there are cheaper ways of computing the modulo directly, eg when computing模 2 ⁿ 。

基于软件仿真或迭代硬件实现，Integer 在大多数平台上的划分都很慢。 但去年广泛报道称，基于苹果 M1 的微基准测试，它有一个非常快的 integer 分区，大概是通过使用专用电路。

自从将近三十年前 Torbjörn Granlund 和 Peter Montgomery 的一篇开创性论文以来，众所周知，如何通过使用 integer 乘法加上可能的移位和/或其他校正步骤，用常数除数替换 integer 除法。 这种算法通常被称为魔术乘法器技术。 它需要从 integer 除数中预先计算一些相关参数，以用于基于乘法的仿真序列。

Torbjörn Granlund 和 Peter L. Montgomery，“使用乘法除以不变整数”， ACM SIGPLAN 通知，卷。 29，1994 年 6 月，第 61-72 页（在线）。

目前，所有主要工具链在处理编译时常数的 integer 除数时都包含了 Granlund-Montgomery 算法的变体。 预计算发生在编译器内部的编译时，然后使用计算的参数发出代码。 一些工具链也可能使用这种算法来通过重复使用的运行时常量除数进行除法。 对于循环中的运行时常数除数，这可能涉及在循环之前发出预计算块以计算必要的参数，然后将这些用于循环内的除法仿真代码。

如果一个人的工具链没有使用运行时常量除数优化除法，则可以手动使用相同的方法，如下面的代码所示。 但是，这不太可能实现与基于编译器的解决方案相同的效率，因为并非所有在所需仿真序列中使用的机器操作都可以在 C++ 级别以可移植方式有效地表达。 这尤其适用于算术右移和进位加法。

下面的代码演示了参数预计算的原理和integer乘法除法仿真。 通过在设计上投入比我愿意为这个答案花费更多的时间，很可能可以识别出参数预计算和仿真的更有效实现。

#include <cstdio>
#include <cstdlib>
#include <cstdint>

#define PORTABLE  (1)
#define ADD_FLAG  (1)
#define NEG_FLAG  (2)

uint32_t ilog2 (uint32_t i)
{
    uint32_t t = 0;
    i = i >> 1;
    while (i) {
        i = i >> 1;
        t++;
    }
    return (t);
}

/* Based on: Granlund, T.; Montgomery, P.L.: "Division by Invariant Integers 
   using Multiplication". SIGPLAN Notices, Vol. 29, June 1994, page 61.
*/
void prepare_magic (int32_t divisor, int32_t &multiplier, int32_t &shift, 
                    int32_t &flags)
{
    uint32_t d, i;
    uint64_t m_lower, m_upper, j, k, msb;

    d = (uint32_t)llabs (divisor);
    i = ilog2 (d);
    msb = (((uint64_t)(1)) << (32 + i));
    j = (((uint64_t)(0x80000000)) % ((uint64_t)(d)));
    k = msb / ((uint64_t)(0x80000000 - j));
    m_lower = msb / d;
    m_upper = (msb + k) / d;
    while (((m_lower >> 1) < (m_upper >> 1)) && (i > 0)) {
        m_lower = m_lower >> 1;
        m_upper = m_upper >> 1;
        i--;
    }
    multiplier = (uint32_t)(m_upper);
    shift = i;
    flags = ((m_upper >> 31) ? ADD_FLAG : 0) | ((divisor < 0) ? NEG_FLAG : 0);
}

int32_t arithmetic_right_shift (int32_t a, int32_t s)
{
    uint32_t mask_msb = 0x80000000;
    uint32_t ua = (uint32_t)a;
    ua = ua >> s;
    mask_msb = mask_msb >> s;
    return (int32_t)((ua ^ mask_msb) - mask_msb);
}

int32_t magic_division (int32_t dividend, int32_t multiplier, int32_t shift, 
                        int32_t flags)
{
    int64_t prod = ((int64_t)dividend) * multiplier;
    int32_t quot = (int32_t)(((uint64_t)prod) >> 32);
    if (flags & ADD_FLAG) quot = (uint32_t)quot + (uint32_t)dividend;
#if PORTABLE
    quot = arithmetic_right_shift (quot, shift);
#else // PORTABLE
    quot = quot >> shift;  // must use arithmetic right shift
#endif // PORTABLE
    quot = quot + ((uint32_t)dividend >> 31);
    if (flags & NEG_FLAG) quot = -quot;
    return quot;
}

int main (void)
{
    int32_t multiplier;
    int32_t shift;
    int32_t flags;
    int32_t divisor;
    
    for (divisor = -10; divisor <= 10; divisor++) {
        /* avoid division by zero */
        if (divisor == 0) {
            divisor++;
            continue;
        }
        printf ("divisor=%d\n", divisor);
        prepare_magic (divisor, multiplier, shift, flags);
        printf ("multiplier=%d shift=%d flags=%d\n", 
                multiplier, shift, flags);
        printf ("exhaustive test of dividends ... ");
        uint32_t dividendu = 0;
        do {
            int32_t dividend = (int32_t)dividendu;
            /* avoid overflow in signed integer division */
            if ((divisor == (-1)) && (dividend == ((-2147483647)-1))) {
                dividendu++;
                continue;
            }
            int32_t res = magic_division (dividend, multiplier, shift, flags);
            int32_t ref = dividend / divisor;
            if (res != ref) {
                printf ("\nERR dividend=%d (%08x) divisor=%d  res=%d  ref=%d\n",
                        dividend, (uint32_t)dividend, divisor, res, ref);
                return EXIT_FAILURE;
            }
            dividendu++;
        } while (dividendu);
        printf ("PASSED\n");
    }
    return EXIT_SUCCESS;
}

Answer 4

如何优化这个循环？

这是libdivide的完美用例。 该库旨在通过使用编译器在编译时使用的策略来加速运行时的常量除法。 该库仅是标头，因此它不会创建任何运行时依赖项。 它还支持除法的向量化（即使用 SIMD 指令），这在这种情况下绝对可以用来显着加快计算速度，而编译器在不显着改变循环的情况下无法做到这一点（最终它不会那么高效，因为运行时定义的除数）。请注意，libdivide 的许可证是非常宽松的（zlib），因此您可以轻松地将其包含在您的项目中，而不受严格限制（如果您更改它，您基本上只需将其标记为已修改）。

如果 header only-libraries 不正常，那么您需要重新实现轮子。 这个想法是将除以常数转换为一系列移位和乘法。 @njuffa 的非常好的答案指定了如何做到这一点。 您还可以阅读高度优化的 libdivide 代码。

对于小的正除数和小的正分红，不需要长序列的操作。 您可以使用基本序列作弊：

uint64_t dividend = corr_indexes[i]; // Must not be too big
uint64_t divider = n_attrs;
uint64_t magic_factor = 4294967296 / n_attrs + 1; // Must be precomputed once
uint32_t result = (dividend * magic_factor) >> 32;

这种方法对于uint16_t的除数/除数应该是安全的，但它不适用于更大的值。 在实践中，如果dividend值高于 ~800_000 则失败。 更大的红利需要更复杂的序列，通常也更慢。

他们有什么方法可以并行化这个循环

只有除法/模数可以安全地并行化。 在循环的 rest 中有一个循环携带的依赖性，它会阻止任何并行化（除非做出额外的假设）。 因此，循环可以分为两部分：一个计算除法并将uint16_t结果放入稍后连续计算的临时数组中。 数组不需要太大，因为计算将被 memory 绑定，否则生成的并行代码可能比当前代码慢。 因此，您需要对至少适合 L3 缓存的小块进行操作。 如果块太小，那么线程同步也可能是一个问题。 最好的解决方案当然是使用滚动的 window 块。 所有这一切肯定有点乏味/难以实施。

请注意，SIMD 指令可用于除法部分（使用 libdivide 很容易）。 您还需要拆分循环并使用块，但块不需要很大，因为没有同步开销。 64 个整数应该足够了。

请注意，最近的处理器可以有效地计算这样的除法，特别是对于 32 位整数（64 位整数往往更昂贵）。 尤其是 Alder Lake、Zen3 和 M1 处理器（P 核）的情况。 请注意，模数和除法都是在 x86/x86-64 处理器上的一条指令中计算的。 另请注意，虽然除法具有相当大的延迟，但许多处理器可以流水线化多个除法以获得合理的吞吐量。 例如，32 位div指令在 Skylake 上的延迟为 23~28 个周期，但倒数吞吐量为 4~6。 Zen1/Zen2 显然不是这种情况。

Answer 5

因此，就我而言，这是实际的最佳解决方案。

而不是表示index = row * n_cols + col ，做index = (row << 16) | col index = (row << 16) | col为 32 位，或index = (row << 32) | col 64 位的index = (row << 32) | col 。 然后row = index >> 32 , col = index & (32 - 1) 。 或者甚至更好，只是uint16_t* pairs = reinterpret_cast<uint16_t*>(index_array); , 然后pair[i], pair[i+1] for each i % 2 == 0是一对。

这是假设行数/列数小于 2^16（或 2^32）。

我仍然保留最佳答案，因为它仍然回答必须使用除法的情况。

使用 const 运行时除数进行快速 integer 除法和取模

问题描述

4 个解决方案

解决方案1
2 2022-09-08 15:57:56

解决方案2
2 2022-09-08 21:07:47

解决方案3
2 已采纳 2022-09-08 23:44:55

解决方案4
0 2022-09-09 17:36:11

解决方案5
0 2023-01-18 03:52:16

使用 const 运行时除数进行快速 integer 除法和取模

问题描述

4 个解决方案

解决方案1 2 2022-09-08 15:57:56

解决方案2 2 2022-09-08 21:07:47

解决方案3 2 已采纳 2022-09-08 23:44:55

解决方案4 0 2022-09-09 17:36:11

解决方案5 0 2023-01-18 03:52:16

解决方案1
2 2022-09-08 15:57:56

解决方案2
2 2022-09-08 21:07:47

解决方案3
2 已采纳 2022-09-08 23:44:55

解决方案4
0 2022-09-09 17:36:11

解决方案5
0 2023-01-18 03:52:16