简体   繁体   English

按位和长按的性能与64位的整数

[英]Performance of bitwise & on longs vs ints on 64 bit

It seems that when performing an & operation between two long s it takes the same amount of time as the equivalent operation inside 4 32bit int s. 似乎在两个long s之间执行&操作时,它需要与4 32bit int s内的等效操作相同的时间。

For example 例如

long1 & long2

Takes as long as 只需要

int1 & int2
int3 & int4

This is running on a 64bit OS and targeting 64bit .net. 这是在64位操作系统上运行,目标是64位.net。

In theory, this should be twice as fast. 从理论上讲,这应该快两倍。 Has anyone encountered this previously? 以前有没有遇到过这个?

EDIT 编辑

As a simplification, imagine I have two lots of 64 bits of data. 作为一种简化,假设我有两批64位数据。 I take those 64 bits and put them into a long , and perform a bitwise & on those two. 我取64位并将它们放入一个long&然后对这两个位执行按位。

I also take those two sets of data, and put the 64 bits into two 32 bit int values and perform two & s. 我也取这两组数据,并将64位放入两个32位int值并执行两个 & s。 I expect to see the long & operation running faster than the int & operation. 我希望看到long & operation的运行速度比int & operation快。

I couldn't reproduce the problem. 我无法重现这个问题。

My test was as follows (int version shown): 我的测试如下(显示的是int版本):

// deliberately made hard to optimise without whole program optimisation
public static int[] data = new int[1000000]; // long[] when testing long

// I happened to have a winforms app open, feel free to make this a console app..
private void button1_Click(object sender, EventArgs e)
{
    long best = long.MaxValue;
    for (int j = 0; j < 1000; j++)
    {
        Stopwatch timer = Stopwatch.StartNew();
        int a1 = ~0, b1 = 0x55555555, c1 = 0x12345678; // varies: see below
        int a2 = ~0, b2 = 0x55555555, c2 = 0x12345678;
        int[] d = data; // long[] when testing long
        for (int i = 0; i < d.Length; i++)
        {
            int v = d[i]; // long when testing long, see below
            a1 &= v; a2 &= v;
            b1 &= v; b2 &= v;
            c1 &= v; c2 &= v;
        }
        // don't average times: we want the result with minimal context switching
        best = Math.Min(best, timer.ElapsedTicks); 
        button1.Text = best.ToString() + ":" + (a1 + a2 + b1 + b2 + c1 + c2).ToString("X8");
    }
}

For testing longs a1 and a2 etc are merged, giving: 为了测试longs a1a2等合并,给出:

long a = ~0, b = 0x5555555555555555, c = 0x1234567812345678;

Running the two programs on my laptop (i7 Q720) as a release build outside of VS (.NET 4.5) I got the following times: 在我的笔记本电脑(i7 Q720)上运行这两个程序作为VS(.NET 4.5) 以外的版本构建我得到以下时间:

int: 2238, long: 1924 int: 2238, long: 1924

Now considering there's a huge amount of loop overhead, and that the long version is working with twice as much data (8mb vs 4mb), it still comes out clearly ahead. 现在考虑到有大量的循环开销,并且long版本使用两倍的数据(8mb对4mb),它仍然明显领先。 So I have no reason to believe that C# is not making full use of the processor's 64 bit bitops. 所以我没有理由相信C#没有充分利用处理器的64位bitops。

But we really shouldn't be benching it in the first place. 但我们真的不应该把它放在第一位。 If there's a concern, simply check the jited code (Debug -> Windows -> Disassembly). 如果有问题,只需检查jited代码(Debug - > Windows - > Disassembly)。 Ensure the compiler's using the instructions you expect it to use, and move on. 确保编译器使用您期望它使用的指令,然后继续。

Attempting to measure the performance of those individual instructions on your processor (and this could well be specific to your processor model) in anything other than assembler is a very bad idea - and from within a jit compiled language like C#, beyond futile. 尝试在处理器上测量那些单独指令的性能(这可能是处理器模型特有的),除了汇编程序之外的其他任何东西都是一个非常糟糕的主意 - 而且从像C#这样的jit编译语言中,这是徒劳的。 But there's no need to anyway, as it's all in Intel's optimisation handbook should you need to know. 但是无论如何都没有必要,因为如果您需要了解英特尔优化手册中的全部内容。

To this end, here's the disassembly of the a &= for the long version of the program on x64 (release, but inside of debugger - unsure if this affects the assembly, but it certainly affects the performance): 为此,这里是对x64程序的long版本的a &=的反汇编(发布,但在调试器内 - 不确定这是否会影响程序集,但它肯定会影响性能):

00000111  mov         rcx,qword ptr [rsp+60h] ; a &= v
00000116  mov         rax,qword ptr [rsp+38h] 
0000011b  and         rax,rcx 
0000011e  mov         qword ptr [rsp+38h],rax 

As you can see there's a single 64 bit and operation as expected, along with three 64 bit moves. 正如您所看到的,有一个64位和预期的操作,以及三个64位移动。 So far so good, and exactly half the number of ops of the int version: 到目前为止这么好,并且正好是int版本操作数量的一半:

00000122  mov         ecx,dword ptr [rsp+5Ch] ; a1 &= v
00000126  mov         eax,dword ptr [rsp+38h] 
0000012a  and         eax,ecx 
0000012c  mov         dword ptr [rsp+38h],eax 
00000130  mov         ecx,dword ptr [rsp+5Ch] ; a2 &= v
00000134  mov         eax,dword ptr [rsp+44h] 
00000138  and         eax,ecx 
0000013a  mov         dword ptr [rsp+44h],eax 

I can only conclude that the problem you're seeing is specific to something about your test suite, build options, processor... or quite possibly, that the & isn't the point of contention you believe it to be. 我只能说,你看到的问题是具体到一些有关你的测试套件,编译选项,处理器......或者很可能,那&不争的你相信它是点。 HTH. HTH。

I can't reproduce your timings. 我无法重现你的时间。 The following code generates two arrays: one of 1,000,000 longs, and one with 2,000,000 ints. 以下代码生成两个数组:一个1,000,000个long,一个具有2,000,000个int。 Then it loops through the arrays, applying the & operator to successive values. 然后它循环遍历数组,将&运算符应用于连续的值。 It keeps a running sum and outputs it, just to make sure that the compiler doesn't decide to remove the loop entirely because it isn't doing anything. 它保持运行总和并输出它,只是为了确保编译器不会决定完全删除循环,因为它没有做任何事情。

Over dozens of successive runs, the long loop is at least twice as fast as the int loop. 经过几十次连续运行, long循环至少是int循环的两倍。 This is running on a Core 2 Quad with Windows 8 Developer Preview and Visual Studio 11 Developer Preview. 这是在带有Windows 8开发人员预览版和Visual Studio 11开发人员预览版的Core 2 Quad上运行的。 Program is compiled with "Any CPU", and run in 64 bit mode. 程序使用“Any CPU”编译,并以64位模式运行。 All testing done using Ctrl+F5 so that the debugger isn't involved. 使用Ctrl + F5完成所有测试,以便不涉及调试器。

        int numLongs = 1000000;
        int numInts = 2*numLongs;
        var longs = new long[numLongs];
        var ints = new int[numInts];
        Random rnd = new Random();
        // generate values
        for (int i = 0; i < numLongs; ++i)
        {
            int i1 = rnd.Next();
            int i2 = rnd.Next();
            ints[2 * i] = i1;
            ints[2 * i + 1] = i2;
            long l = i1;
            l = (l << 32) | (uint)i2;
            longs[i] = l;
        }

        // time operations.
        int isum = 0;
        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < numInts; i += 2)
        {
            isum += ints[i] & ints[i + 1];
        }
        sw.Stop();
        Console.WriteLine("Ints: {0} ms. isum = {1}", sw.ElapsedMilliseconds, isum);

        long lsum = 0;
        int halfLongs = numLongs / 2;
        sw.Restart();
        for (int i = 0; i < halfLongs; i += 2)
        {
            lsum += longs[i] & longs[i + 1];
        }
        sw.Stop();
        Console.WriteLine("Longs: {0} ms. lsum = {1}", sw.ElapsedMilliseconds, lsum);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM