64位模运算的奇怪性能表现

Question

The last three of these method calls take approx. 这些方法的最后三个调用大约需要花费时间。 double the time than the first four. 时间是前四个时间的两倍。

The only difference is that their arguments doesn't fit in integer anymore. 唯一的区别是它们的参数不再适合整数。 But should this matter? 但是这应该重要吗？ The parameter is declared to be long, so it should use long for calculation anyway. 该参数声明为long，因此无论如何应使用long进行计算。 Does the modulo operation use another algorithm for numbers>maxint? 模运算是否对数字> maxint使用其他算法？

I am using amd athlon64 3200+, winxp sp3 and vs2008. 我正在使用AMD Athlon64 3200 +，WinXP SP3和vs2008。

       Stopwatch sw = new Stopwatch();
       TestLong(sw, int.MaxValue - 3l);
       TestLong(sw, int.MaxValue - 2l);
       TestLong(sw, int.MaxValue - 1l);
       TestLong(sw, int.MaxValue);
       TestLong(sw, int.MaxValue + 1l);
       TestLong(sw, int.MaxValue + 2l);
       TestLong(sw, int.MaxValue + 3l);
       Console.ReadLine();

    static void TestLong(Stopwatch sw, long num)
    {
        long n = 0;
        sw.Reset();
        sw.Start();
        for (long i = 3; i < 20000000; i++)
        {
            n += num % i;
        }
        sw.Stop();
        Console.WriteLine(sw.Elapsed);            
    }

EDIT: I now tried the same with C and the issue does not occur here, all modulo operations take the same time, in release and in debug mode with and without optimizations turned on: 编辑：我现在尝试使用C进行相同操作，并且这里不会发生问题，所有模运算都需要在释放和调试模式下同时启用和不启用优化的相同时间：

#include "stdafx.h"
#include "time.h"
#include "limits.h"

static void TestLong(long long num)
{
    long long n = 0;

    clock_t t = clock();
    for (long long i = 3; i < 20000000LL*100; i++)
    {
        n += num % i;
    }

    printf("%d - %lld\n", clock()-t, n);  
}

int main()
{
    printf("%i %i %i %i\n\n", sizeof (int), sizeof(long), sizeof(long long), sizeof(void*));

    TestLong(3);
    TestLong(10);
    TestLong(131);
    TestLong(INT_MAX - 1L);
    TestLong(UINT_MAX +1LL);
    TestLong(INT_MAX + 1LL);
    TestLong(LLONG_MAX-1LL);

    getchar();
    return 0;
}

EDIT2: 编辑2：

Thanks for the great suggestions. 感谢您的宝贵建议。 I found that both .net and c (in debug as well as in release mode) does't not use atomically cpu instructions to calculate the remainder but they call a function that does. 我发现.net和c（在调试以及发布模式下）都没有使用原子cpu指令来计算余数，但是它们调用了一个函数。

In the c program I could get the name of it which is "_allrem". 在C程序中，我可以得到它的名称为“ _allrem”。 It also displayed full source comments for this file so I found the information that this algorithm special cases the 32bit divisors instead of dividends which was the case in the .net application. 它还显示了该文件的完整源注释，因此我发现了该算法特殊情况下使用32位除数而不是.net应用程序中的红利的信息。

I also found out that the performance of the c program really is only affected by the value of the divisor but not the dividend. 我还发现，c程序的性能实际上仅受除数的值影响，而不受除数的影响。 Another test showed that the performance of the remainder function in the .net program depends on both the dividend and divisor. 另一个测试表明，.net程序中其余函数的性能取决于除数和除数。

BTW: Even simple additions of long long values are calculated by a consecutive add and adc instructions. 顺便说一句：即使长加长值的简单加法，也可以通过连续的加法和adc指令来计算。 So even if my processor calls itself 64bit, it really isn't :( 因此，即使我的处理器将自己称为64位，也确实不是:(

EDIT3: 编辑3：

I now ran the c app on a windows 7 x64 edition, compiled with visual studio 2010. The funny thing is, the performance behavior stays the same, although now (I checked the assembly source) true 64 bit instructions are used. 我现在在用Visual Studio 2010编译的Windows 7 x64版本上运行c应用程序。有趣的是，尽管现在（我检查了汇编源代码）使用的是真正的64位指令，但性能表现仍保持不变。

Answer 1

What a curious observation. 真奇怪。 Here's something you can do to investigate this further: add a "pause" at the beginning of the program, like a Console.ReadLine, but AFTER the first call to your method. 您可以做一些进一步的研究：在程序的开头添加一个“ pause”，例如Console.ReadLine，但在第一次调用您的方法之后。 Then build the program in "release" mode. 然后以“发布”模式构建程序。 Then start the program not in the debugger . 然后启动不在调试器中的程序。 Then, at the pause, attach the debugger. 然后，在暂停时，连接调试器。 Debug through it and take a look at the code jitted for the method in question. 对其进行调试，然后查看为该方法添加的代码。 It should be pretty easy to find the loop body. 找到循环体应该很容易。

It would be interesting to know how the generated loop body differs from that in your C program. 了解所生成的循环体与您的C程序有何不同将很有趣。

The reason for all those hoops to jump through is because the jitter changes what code it generates when jitting a "debug" assembly or when jitting a program that already has a debugger attached; 跳过所有这些循环的原因是，抖动会在添加“调试”程序集或添加已附加调试器的程序时更改其生成的代码。 it jits code that is easier to understand in a debugger in those cases. 在这种情况下，它会发出易于在调试器中理解的代码。 It would be more interesting to see what the jitter thinks is the "best" code generated for this case, so you have to attach the debugger late, after the jitter has run. 看到抖动认为是在这种情况下生成的“最佳”代码会更有趣，因此在抖动运行之后，您必须延迟附加调试器。

Answer 2

Have you tried performing the same operations in native code on your box? 您是否尝试过在盒子上用本机代码执行相同的操作？

I wouldn't be surprised if the native 64-bit remainder operation special-cased situations where both arguments are within the 32-bit range, basically delegating that to the 32-bit operation. 如果原生的64位余数运算在两种情况下都处于32位范围内，并且将它们委托给32位运算，则在这种情况下，我不会感到惊讶。 (Or possibly it's the JIT that does that...) It does make a fair amount of sense to optimise that case, doesn't it? （或者可能是JIT做到了……）优化这种情况确实很有意义，不是吗？

64位模运算的奇怪性能表现

问题描述

2 个解决方案

解决方案1
4 2010-01-21 15:17:56

解决方案2
3 2010-01-21 11:26:20

64位模运算的奇怪性能表现

问题描述

2 个解决方案

解决方案1 4 2010-01-21 15:17:56

解决方案2 3 2010-01-21 11:26:20

解决方案1
4 2010-01-21 15:17:56

解决方案2
3 2010-01-21 11:26:20