是否可以在不影响性能的情况下将此宏更改为内联函数？

Question

(EDIT: Let's title this, "Lessons in how measurements can go wrong." I still haven't figured out exactly what's causing the discrepancy though.) （编辑：让我们以此为标题，“如何进行错误测量的经验教训。”尽管如此，我仍然没有弄清楚到底是什么导致了差异。）

I found a very fast integer square root function here by Mark Crowne. 我在这里由Mark Crowne找到了一个非常快速的整数平方根函数。 At least with GCC on my machine, it's clearly the fastest integer square root function I've tested (including the functions in Hacker's Delight, this page , and floor(sqrt()) from the standard library). 至少在我的计算机上使用GCC时，它显然是我测试过的最快的整数平方根函数（包括标准库中的Hacker's Delight，此页面和floor（sqrt（））中的函数）。

After cleaning up the formatting a bit, renaming a variable, and using fixed-width types, it looks like this: 在清理了一下格式，重命名了一个变量并使用了固定宽度类型之后，它看起来像这样：

static uint32_t mcrowne_isqrt(uint32_t val)
{
    uint32_t temp, root = 0;

    if (val >= 0x40000000)
    {
        root = 0x8000;
        val -= 0x40000000;
    }

    #define INNER_ISQRT(s)                              \
    do                                                  \
    {                                                   \
        temp = (root << (s)) + (1 << ((s) * 2 - 2));    \
        if (val >= temp)                                \
        {                                               \
            root += 1 << ((s)-1);                       \
            val -= temp;                                \
        }                                               \
    } while(0)

    INNER_ISQRT(15);
    INNER_ISQRT(14);
    INNER_ISQRT(13);
    INNER_ISQRT(12);
    INNER_ISQRT(11);
    INNER_ISQRT(10);
    INNER_ISQRT( 9);
    INNER_ISQRT( 8);
    INNER_ISQRT( 7);
    INNER_ISQRT( 6);
    INNER_ISQRT( 5);
    INNER_ISQRT( 4);
    INNER_ISQRT( 3);
    INNER_ISQRT( 2);

    #undef INNER_ISQRT

    temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

The INNER_ISQRT macro isn't too evil, since it's local and immediately undefined after it's no longer needed. INNER_ISQRT宏并不是太邪恶，因为它是本地的，在不再需要时立即未定义。 Nevertheless, I'd still like to convert it to an inline function as a matter of principle. 尽管如此，原则上我还是想将其转换为内联函数。 I've read assertions in several places (including the GCC documentation) that inline functions are "just as fast" as macros, but I've had trouble converting it without a speed hit. 我在许多地方（包括GCC文档）都读到了断言，内联函数与宏“一样快”，但是我在转换速度时遇到了麻烦。

My current iteration looks like this (note the always_inline attribute, which I threw in for good measure): 我当前的迭代看起来像这样（请注意always_inline属性，我将其放入一个很好的度量中）：

static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root) __attribute__((always_inline));
static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root)
{
    const uint32_t temp = (root << s) + (1 << ((s << 1) - 2));
    if(val >= temp)
    {
        root += 1 << (s - 1);
        val -= temp;
    }
}

//  Note that I just now changed the name to mcrowne_inline_isqrt, so people can compile my full test.
static uint32_t mcrowne_inline_isqrt(uint32_t val)
{
    uint32_t root = 0;

    if(val >= 0x40000000)
    {
        root = 0x8000; 
        val -= 0x40000000;
    }

    inner_isqrt(15, val, root);
    inner_isqrt(14, val, root);
    inner_isqrt(13, val, root);
    inner_isqrt(12, val, root);
    inner_isqrt(11, val, root);
    inner_isqrt(10, val, root);
    inner_isqrt(9, val, root);
    inner_isqrt(8, val, root);
    inner_isqrt(7, val, root);
    inner_isqrt(6, val, root);
    inner_isqrt(5, val, root);
    inner_isqrt(4, val, root);
    inner_isqrt(3, val, root);
    inner_isqrt(2, val, root);

    const uint32_t temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

No matter what I do, the inline function is always slower than the macro. 不管我做什么，内联函数总是比宏慢。 The macro version commonly times at around 2.92s for (2^28 - 1) iterations with an -O2 build, whereas the inline version commonly times at around 3.25s. 对于具有-O2版本的（2 ^ 28-1）迭代，宏版本的时间通常约为2.92s，而内联版本的时间通常约为3.25s。 EDIT: I said 2^32 - 1 iterations before, but I forgot that I had changed it. 编辑：我之前说过2 ^ 32-1次迭代，但是我忘了我已经更改了它。 They take quite a bit longer for the full gamut. 他们需要更长的时间才能获得全部色域。

It's possible that the compiler is just being stupid and refusing to inline it (note again the always_inline attribute!), but if so, that would make the macro version generally preferable anyway. 编译器可能只是愚蠢而拒绝内联它（再次注意always_inline属性！），但是如果这样，无论如何，这通常会使宏版本更可取。 (I tried checking the assembly to see, but it was too complicated as part of a program. The optimizer omitted everything when I tried compiling just the functions of course, and I'm having issues compiling it as a library due to noobishness with GCC.) （我尝试检查程序集以进行查看，但是作为程序的一部分，它太复杂了。当我尝试仅编译功能时，优化器会忽略所有内容，由于GCC的繁琐性，我在将其编译为库时遇到了问题）

In short, is there a way to write this as an inline without a speed hit? 简而言之，有没有办法将其写为内联而不影响速度？ (I haven't profiled, but sqrt is one of those fundamental operations that should always be made fast, since I may be using it in many other programs than just the one I'm currently interested in. Besides, I'm just curious.) （我没有介绍过，但是sqrt是应该始终快速进行的那些基本操作之一，因为我可能会在除我当前感兴趣的程序之外的其他许多程序中使用它。此外，我只是好奇）

I've even tried using templates to "bake in" the constant value, but I get the feeling the other two parameters are more likely to be causing the hit (and the macro can avoid that, since it uses local variables directly)...well, either that or the compiler is stubbornly refusing to inline. 我什至尝试使用模板“引入”常量值，但是我感到其他两个参数更有可能引起点击（宏可以避免这种情况，因为它直接使用局部变量）。 .well，或者编译器顽固地拒绝内联。

UPDATE: user1034749 below is getting the same assembly output from both functions when he puts them in separate files and compiles them. 更新：下面的user1034749将两个函数放在单独的文件中并进行编译时，正在从这两个函数中获得相同的程序集输出。 I tried his exact command line, and I'm getting the same result as him. 我尝试了他的确切命令行，并且得到了与他相同的结果。 For all intents and purposes, this question is solved. 出于所有目的和目的，此问题已解决。

However, I'd still like to know why my measurements are coming out differently. 但是，我仍然想知道为什么我的测量结果有所不同。 Obviously, my measurement code or original build process was causing things to be different. 显然，我的测量代码或原始构建过程使情况有所不同。 I'll post the code below. 我将在下面发布代码。 Does anyone know what the deal was? 有人知道这笔交易吗？ Maybe my compiler is actually inlining the whole mcrowne_isqrt() function in the loop of my main() function, but it's not inlining the entirety of the other version? 也许我的编译器实际上是在main（）函数的循环中内联整个mcrowne_isqrt（）函数，但未在其他版本中内联？

UPDATE 2 (squeezed in before testing code): Note that if I swap the order of the tests and make the inline version come first, the inline version comes out faster than the macro version by the same amount. UPDATE 2（在测试代码之前被压入）：请注意，如果我交换测试顺序并使内联版本排在第一位，则内联版本比宏版本快出相同数量。 Is this a caching issue, or is the compiler inlining one call but not the other, or what? 这是一个缓存问题，还是编译器内联一个调用而不是另一个调用，或者是什么？

#include <iostream>
#include <time.h>      //  Linux high-resolution timer
#include <stdint.h>

/*  Functions go here */

timespec timespecdiff(const timespec& start, const timespec& end)
{
    timespec elapsed;
    timespec endmod = end;
    if(endmod.tv_nsec < start.tv_nsec)
    {
        endmod.tv_sec -= 1;
        endmod.tv_nsec += 1000000000;
    }

    elapsed.tv_sec = endmod.tv_sec - start.tv_sec;
    elapsed.tv_nsec = endmod.tv_nsec - start.tv_nsec;
    return elapsed;
}


int main()
{
    uint64_t inputlimit = 4294967295;
    //  Test a wide range of values
    uint64_t widestep = 16;

    timespec start, end;

    //  Time macro version:
    uint32_t sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowntime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;


    //  Time inline version:
    sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_inline_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowninlinetime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's inline sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;

    //  Results:
    std::cout << "Mark Crowne sqrt variant time:\t" << markcrowntime.tv_sec << "s, " << markcrowntime.tv_nsec << "ns" << std::endl;
    std::cout << "Mark Crowne inline sqrt variant time:\t" << markcrowninlinetime.tv_sec << "s, " << markcrowninlinetime.tv_nsec << "ns" << std::endl;
    std::cout << std::endl;
}

UPDATE 3: I still have no idea how to reliably compare the timing of different functions without the timing depending on the order of the tests. 更新3：我仍然不知道如何可靠地比较不同功能的时序，而没有时序取决于测试的顺序。 I'd greatly appreciate any tips! 我将不胜感激任何提示！

However, if anyone else reading this is interested in fast sqrt implementations, I should mention: Mark Crowne's code tests faster than any other pure C/C++ version I've tried by a decent margin (despite reliability issues with testing), but the following SSE code seems like it might be a little bit faster still for a scalar 32-bit integer sqrt. 但是，如果其他人对快速sqrt实现感兴趣，我应该提到：Mark Crowne的代码测试速度比我尝试过的任何其他纯C / C ++版本都要快（尽管测试存在可靠性问题），但以下内容SSE代码似乎对于标量32位整数sqrt可能仍然更快一些。 It can't be generalized for full-blown 64-bit unsigned integer inputs without losing precision though (and the first signed conversion would also have to be replaced by a load intrinsic to handle values >= 2^63): 但是，不能将其一般化为成熟的64位无符号整数输入，而又不损失精度（而且第一个有符号转换还必须由处理值> = 2 ^ 63的固有负载替换）：

uint32_t sse_sqrt(uint64_t num)
{
    //  Uses 64-bit input, because SSE conversion functions treat all
    //  integers as signed (so conversion from a 32-bit value >= 2^31
    //  will be interpreted as negative).  As it stands, this function
    //  will similarly fail for values >= 2^63.
    //  It can also probably be made faster, since it generates a strange/
    //  useless movsd %xmm0,%xmm0 instruction before the sqrtsd.  It clears
    //  xmm0 first too with xorpd (seems unnecessary, but I could be wrong).
    __m128d result;
    __m128d num_as_sse_double = _mm_cvtsi64_sd(result, num);
    result = _mm_sqrt_sd(num_as_sse_double, num_as_sse_double);
    return _mm_cvttsd_si32(result);
}

Answer 1

I tried your code with gcc 4.5.3. 我用gcc 4.5.3。尝试了您的代码。 I modified your second version of code to match the first one, for example: 我修改了第二个版本的代码以匹配第一个版本，例如：

(1 << ((s) * 2 - 2)

vs 与

(1 << ((s << 1) - 1)

yes, s * 2 == s << 1, but "-2" and "-1"? 是的，s * 2 == s << 1，但“ -2”和“ -1”？

Also I modified your types replace uint32_t with "unsigned long", because of on my 64 bit machine "long" is not 32bit number. 我还修改了您的类型，将uint32_t替换为“ unsigned long”，因为在我的64位计算机上，“ long”不是32位数字。

And then I run: 然后我运行：

g++ -ggdb -O2 -march=native -c -pipe inline.cpp
g++ -ggdb -O2 -march=native -c -pipe macros.cpp
objdump -d inline.o > inline.s
objdump -d macros.o > macros.s

I could use "-S" instead of "-c" to assembler, but I would like to see assembler without additional info. 我可以使用“ -S”而不是“ -c”进行汇编，但是我希望看到没有附加信息的汇编器。

and you know what? 你知道吗？
The assembler completly the same, in the first and in the second verison. 在第一版和第二版中，组装程序完全相同。 So I think your time measurements are just wrong. 所以我认为您的时间量度是错误的。

是否可以在不影响性能的情况下将此宏更改为内联函数？

问题描述

1 个解决方案

解决方案1
7 已采纳 2011-11-22 05:29:36

是否可以在不影响性能的情况下将此宏更改为内联函数？

问题描述

1 个解决方案

解决方案1 7 已采纳 2011-11-22 05:29:36

解决方案1
7 已采纳 2011-11-22 05:29:36