简体   繁体   English

在32位系统上使用int64_t而不是int32_t会对性能产生什么影响?

[英]what is the performance impact of using int64_t instead of int32_t on 32-bit systems?

Our C++ library currently uses time_t for storing time values. 我们的C ++库目前使用time_t来存储时间值。 I'm beginning to need sub-second precision in some places, so a larger data type will be necessary there anyway. 我开始在某些地方需要亚秒精度,因此无论如何都需要更大的数据类型。 Also, it might be useful to get around the Year-2038 problem in some places. 此外,在某些地方解决2038年问题可能会有所帮助。 So I'm thinking about completely switching to a single Time class with an underlying int64_t value, to replace the time_t value in all places. 所以我正在考虑完全切换到具有基础int64_t值的单个Time类,以替换所有位置的time_t值。

Now I'm wondering about the performance impact of such a change when running this code on a 32-bit operating system or 32-bit CPU. 现在,我想知道在32位操作系统或32位CPU上运行此代码时这种更改对性能的影响。 IIUC the compiler will generate code to perform 64-bit arithmetic using 32-bit registers. IIUC编译器将生成使用32位寄存器执行64位算术的代码。 But if this is too slow, I might have to use a more differentiated way for dealing with time values, which might make the software more difficult to maintain. 但是如果这太慢了,我可能不得不使用更加差异化的方式来处理时间值,这可能会使软件更难维护。

What I'm interested in: 我感兴趣的是:

  • which factors influence performance of these operations? 哪些因素影响这些业务的表现? Probably the compiler and compiler version; 可能是编译器和编译器版本; but does the operating system or the CPU make/model influence this as well? 但操作系统或CPU制造商/型号是否也会影响这一点? Will a normal 32-bit system use the 64-bit registers of modern CPUs? 普通的32位系统是否会使用现代CPU的64位寄存器?
  • which operations will be especially slow when emulated on 32-bit? 在32位模拟时哪些操作会特别慢? Or which will have nearly no slowdown? 或者几乎没有减速?
  • are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems? 是否存在在32位系统上使用int64_t / uint64_t的任何现有基准测试结果?
  • does anyone have own experience about this performance impact? 有没有人对这种性能影响有自己的经验?

I'm mostly interested in g++ 4.1 and 4.4 on Linux 2.6 (RHEL5, RHEL6) on Intel Core 2 systems; 我最感兴趣的是英特尔酷睿2系统上Linux 2.6(RHEL5,RHEL6)上的g ++ 4.1和4.4; but it would also be nice to know about the situation for other systems (like Sparc Solaris + Solaris CC, Windows + MSVC). 但了解其他系统(如Sparc Solaris + Solaris CC,Windows + MSVC)的情况也很好。

which factors influence performance of these operations? 哪些因素影响这些业务的表现? Probably the compiler and compiler version; 可能是编译器和编译器版本; but does the operating system or the CPU make/model influence this as well? 但操作系统或CPU制造商/型号是否也会影响这一点?

Mostly the processor architecture (and model - please read model where I mention processor architecture in this section). 主要是处理器架构(和模型 - 请阅读我在本节中提到处理器架构的模型)。 The compiler may have some influence, but most compilers do pretty well on this, so the processor architecture will have a bigger influence than the compiler. 编译器可能会有一些影响,但大多数编译器在这方面做得很好,因此处理器架构将比编译器具有更大的影响力。

The operating system will have no influence whatsoever (other than "if you change OS, you need to use a different type of compiler which changes what the compiler does" in some cases - but that's probably a small effect). 操作系统将没有任何影响(除了“如果你改变操作系统,你需要使用不同类型的编译器,在某些情况下改变编译器的功能” - 但这可能是一个很小的影响)。

Will a normal 32-bit system use the 64-bit registers of modern CPUs? 普通的32位系统是否会使用现代CPU的64位寄存器?

This is not possible. 这是不可能的。 If the system is in 32-bit mode, it will act as a 32-bit system, the extra 32-bits of the registers is completely invisible, just as it would be if the system was actually a "true 32-bit system". 如果系统处于32位模式,它将充当32位系统,额外的32位寄存器是完全不可见的,就像系统实际上是“真正的32位系统”一样。

which operations will be especially slow when emulated on 32-bit? 在32位模拟时哪些操作会特别慢? Or which will have nearly no slowdown? 或者几乎没有减速?

Addition and subtraction, is worse as these have to be done in sequence of two operations, and the second operation requires the first to have completed - this is not the case if the compiler is just producing two add operations on independent data. 加法和减法更糟糕,因为这些必须按两个操作的顺序完成,而第二个操作要求第一个完成 - 如果编译器只对独立数据产生两个添加操作,则不是这种情况。

Mulitplication will get a lot worse if the input parameters are actually 64-bits - so 2^35 * 83 is worse than 2^31 * 2^31, for example. 如果输入参数实际上是64位,那么多重复制会变得更糟 - 例如,2 ^ 35 * 83比2 ^ 31 * 2 ^ 31差。 This is due to the fact that the processor can produce a 32 x 32 bit multiply into a 64-bit result pretty well - some 5-10 clockcycles. 这是因为处理器可以很好地生成32位32位乘法到64位结果 - 大约5-10个时钟周期。 But a 64 x 64 bit multiply requires a fair bit of extra code, so will take longer. 但64 x 64位乘法需要相当多的额外代码,因此需要更长时间。

Division is a similar problem to multiplication - but here it's OK to take a 64-bit input on the one side, divide it by a 32-bit value and get a 32-bit value out. 除法是一个类似的乘法问题 - 但是这里可以在一侧采用64位输入,将其除以32位值并获得32位值。 Since it's hard to predict when this will work, the 64-bit divide is probably nearly always slow. 由于很难预测这种情况何时起作用,因此64位除法可能几乎总是很慢。

The data will also take twice as much cache-space, which may impact the results. 数据也将占用缓存空间的两倍,这可能会影响结果。 And as a similar consequence, general assignment and passing data around will take twice as long as a minimum, since there is twice as much data to operate on. 并且作为类似的结果,一般的分配和传递数据将花费两倍于最小值,因为有两倍的数据可以操作。

The compiler will also need to use more registers. 编译器还需要使用更多寄存器。

are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems? 是否存在在32位系统上使用int64_t / uint64_t的任何现有基准测试结果?

Probably, but I'm not aware of any. 可能,但我不知道。 And even if there are, it would only be somewhat meaningful to you, since the mix of operations is HIGHLY critical to the speed of operations. 即使有,也只对你有所帮助,因为操作组合对操作速度至关重要。

If performance is an important part of your application, then benchmark YOUR code (or some representative part of it). 如果性能是您的应用程序的重要部分,那么对您的代码(或其代表性部分)进行基准测试。 It doesn't really matter if Benchmark X gives 5%, 25% or 103% slower results, if your code is some completely different amount slower or faster under the same circumstances. 如果你的代码在相同的情况下变得更慢或更快,那么Benchmark X会给出5%,25%或103%的慢速结果并不重要。

does anyone have own experience about this performance impact? 有没有人对这种性能影响有自己的经验?

I've recompiled some code that uses 64-bit integers for 64-bit architecture, and found the performance improve by some substantial amount - as much as 25% on some bits of code. 我重新编译了一些使用64位结构的64位整数的代码,发现性能提高了一些 - 在某些代码位上高达25%。

Changing your OS to a 64-bit version of the same OS, would help, perhaps? 将操作系统更改为相同操作系统的64位版本可能会有所帮助吗?

Edit: 编辑:

Because I like to find out what the difference is in these sort of things, I have written a bit of code, and with some primitive template (still learning that bit - templates isn't exactly my hottest topic, I must say - give me bitfiddling and pointer arithmetics, and I'll (usually) get it right... ) 因为我想知道这些事情的不同之处,我已经编写了一些代码,并且使用了一些原始模板(仍在学习那些位模板不是我最热门的话题,我必须说 - 给我bitfiddling和指针算术,我(通常)做对了......)

Here's the code I wrote, trying to replicate a few common functons: 这是我写的代码,试图复制一些常见的功能:

#include <iostream>
#include <cstdint>
#include <ctime>

using namespace std;

static __inline__ uint64_t rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}

template<typename T>
static T add_numbers(const T *v, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i];
    return sum;
}


template<typename T, const int size>
static T add_matrix(const T v[size][size])
{
    T sum[size] = {};
    for(int i = 0; i < size; i++)
    {
    for(int j = 0; j < size; j++)
        sum[i] += v[i][j];
    }
    T tsum=0;
    for(int i = 0; i < size; i++)
    tsum += sum[i];
    return tsum;
}



template<typename T>
static T add_mul_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] * mul;
    return sum;
}

template<typename T>
static T add_div_numbers(const T *v, const T mul, const int size)
{
    T sum = 0;
    for(int i = 0; i < size; i++)
    sum += v[i] / mul;
    return sum;
}


template<typename T> 
void fill_array(T *v, const int size)
{
    for(int i = 0; i < size; i++)
    v[i] = i;
}

template<typename T, const int size> 
void fill_array(T v[size][size])
{
    for(int i = 0; i < size; i++)
    for(int j = 0; j < size; j++)
        v[i][j] = i + size * j;
}




uint32_t bench_add_numbers(const uint32_t v[], const int size)
{
    uint32_t res = add_numbers(v, size);
    return res;
}

uint64_t bench_add_numbers(const uint64_t v[], const int size)
{
    uint64_t res = add_numbers(v, size);
    return res;
}

uint32_t bench_add_mul_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_mul_numbers(v, c, size);
    return res;
}

uint64_t bench_add_mul_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_mul_numbers(v, c, size);
    return res;
}

uint32_t bench_add_div_numbers(const uint32_t v[], const int size)
{
    const uint32_t c = 7;
    uint32_t res = add_div_numbers(v, c, size);
    return res;
}

uint64_t bench_add_div_numbers(const uint64_t v[], const int size)
{
    const uint64_t c = 7;
    uint64_t res = add_div_numbers(v, c, size);
    return res;
}


template<const int size>
uint32_t bench_matrix(const uint32_t v[size][size])
{
    uint32_t res = add_matrix(v);
    return res;
}
template<const int size>
uint64_t bench_matrix(const uint64_t v[size][size])
{
    uint64_t res = add_matrix(v);
    return res;
}


template<typename T>
void runbench(T (*func)(const T *v, const int size), const char *name, T *v, const int size)
{
    fill_array(v, size);

    uint64_t long t = rdtsc();
    T res = func(v, size);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}

template<typename T, const int size>
void runbench2(T (*func)(const T v[size][size]), const char *name, T v[size][size])
{
    fill_array(v);

    uint64_t long t = rdtsc();
    T res = func(v);
    t = rdtsc() - t;
    cout << "result = " << res << endl;
    cout << name << " time in clocks " << dec << t  << endl;
}


int main()
{
    // spin up CPU to full speed...
    time_t t = time(NULL);
    while(t == time(NULL)) ;

    const int vsize=10000;

    uint32_t v32[vsize];
    uint64_t v64[vsize];

    uint32_t m32[100][100];
    uint64_t m64[100][100];


    runbench(bench_add_numbers, "Add 32", v32, vsize);
    runbench(bench_add_numbers, "Add 64", v64, vsize);

    runbench(bench_add_mul_numbers, "Add Mul 32", v32, vsize);
    runbench(bench_add_mul_numbers, "Add Mul 64", v64, vsize);

    runbench(bench_add_div_numbers, "Add Div 32", v32, vsize);
    runbench(bench_add_div_numbers, "Add Div 64", v64, vsize);

    runbench2(bench_matrix, "Matrix 32", m32);
    runbench2(bench_matrix, "Matrix 64", m64);
}

Compiled with: 编译:

g++ -Wall -m32 -O3 -o 32vs64 32vs64.cpp -std=c++0x

And the results are: Note: See 2016 results below - these results are slightly optimistic due to the difference in usage of SSE instructions in 64-bit mode, but no SSE usage in 32-bit mode. 结果如下注意:请参阅下面的2016年结果 - 由于64位模式下SSE指令的使用不同,这些结果略微乐观,但在32位模式下没有SSE使用。

result = 49995000
Add 32 time in clocks 20784
result = 49995000
Add 64 time in clocks 30358
result = 349965000
Add Mul 32 time in clocks 30182
result = 349965000
Add Mul 64 time in clocks 79081
result = 7137858
Add Div 32 time in clocks 60167
result = 7137858
Add Div 64 time in clocks 457116
result = 49995000
Matrix 32 time in clocks 22831
result = 49995000
Matrix 64 time in clocks 23823

As you can see, addition, and multiplication isn't that much worse. 正如您所看到的,添加和乘法并没有那么糟糕。 Division gets really bad. 分部变得非常糟糕。 Interestingly, the matrix addition is not much difference at all. 有趣的是,矩阵的添加完全没有太大区别。

And is it faster on 64-bit I hear some of you ask: Using the same compiler options, just -m64 instead of -m32 - yupp, a lot faster: 并且它在64位上更快我听到你们有些人问:使用相同的编译器选项,只需-m64而不是-m32 - yupp,速度要快得多:

result = 49995000
Add 32 time in clocks 8366
result = 49995000
Add 64 time in clocks 16188
result = 349965000
Add Mul 32 time in clocks 15943
result = 349965000
Add Mul 64 time in clocks 35828
result = 7137858
Add Div 32 time in clocks 50176
result = 7137858
Add Div 64 time in clocks 50472
result = 49995000
Matrix 32 time in clocks 12294
result = 49995000
Matrix 64 time in clocks 14733

Edit, update for 2016 : four variants, with and without SSE, in 32- and 64-bit mode of the compiler. 编辑,2016年更新 :在编译器的32位和64位模式下,有和没有SSE的四种变体。

I'm typically using clang++ as my usual compiler these days. 我现在通常使用clang ++作为我常用的编译器。 I tried compiling with g++ (but it would still be a different version than above, as I've updated my machine - and I have a different CPU too). 我尝试使用g ++编译(但它仍然是一个与上面不同的版本,因为我已经更新了我的机器 - 我也有不同的CPU)。 Since g++ failed to compile the no-sse version in 64-bit, I didn't see the point in that. 由于g ++无法编译64位的no-sse版本,我没有看到这一点。 (g++ gives similar results anyway) (g ++无论如何都给出了类似的结果)

As a short table: 作为短表:

Test name      | no-sse 32 | no-sse 64 | sse 32 | sse 64 |
----------------------------------------------------------
Add uint32_t   |   20837   |   10221   |   3701 |   3017 |
----------------------------------------------------------
Add uint64_t   |   18633   |   11270   |   9328 |   9180 |
----------------------------------------------------------
Add Mul 32     |   26785   |   18342   |  11510 |  11562 |
----------------------------------------------------------
Add Mul 64     |   44701   |   17693   |  29213 |  16159 |
----------------------------------------------------------
Add Div 32     |   44570   |   47695   |  17713 |  17523 |
----------------------------------------------------------
Add Div 64     |  405258   |   52875   | 405150 |  47043 |
----------------------------------------------------------
Matrix 32      |   41470   |   15811   |  21542 |   8622 |
----------------------------------------------------------
Matrix 64      |   22184   |   15168   |  13757 |  12448 |

Full results with compile options. 编译选项的完整结果。

$ clang++ -m32 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 20837
result = 49995000
Add 64 time in clocks 18633
result = 349965000
Add Mul 32 time in clocks 26785
result = 349965000
Add Mul 64 time in clocks 44701
result = 7137858
Add Div 32 time in clocks 44570
result = 7137858
Add Div 64 time in clocks 405258
result = 49995000
Matrix 32 time in clocks 41470
result = 49995000
Matrix 64 time in clocks 22184

$ clang++ -m32 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3701
result = 49995000
Add 64 time in clocks 9328
result = 349965000
Add Mul 32 time in clocks 11510
result = 349965000
Add Mul 64 time in clocks 29213
result = 7137858
Add Div 32 time in clocks 17713
result = 7137858
Add Div 64 time in clocks 405150
result = 49995000
Matrix 32 time in clocks 21542
result = 49995000
Matrix 64 time in clocks 13757


$ clang++ -m64 -msse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 3017
result = 49995000
Add 64 time in clocks 9180
result = 349965000
Add Mul 32 time in clocks 11562
result = 349965000
Add Mul 64 time in clocks 16159
result = 7137858
Add Div 32 time in clocks 17523
result = 7137858
Add Div 64 time in clocks 47043
result = 49995000
Matrix 32 time in clocks 8622
result = 49995000
Matrix 64 time in clocks 12448


$ clang++ -m64 -mno-sse 32vs64.cpp --std=c++11 -O2
$ ./a.out
result = 49995000
Add 32 time in clocks 10221
result = 49995000
Add 64 time in clocks 11270
result = 349965000
Add Mul 32 time in clocks 18342
result = 349965000
Add Mul 64 time in clocks 17693
result = 7137858
Add Div 32 time in clocks 47695
result = 7137858
Add Div 64 time in clocks 52875
result = 49995000
Matrix 32 time in clocks 15811
result = 49995000
Matrix 64 time in clocks 15168

More than you ever wanted to know about doing 64-bit math in 32-bit mode... 你想知道在32位模式下进行64位数学比你想知道的更多......

When you use 64-bit numbers on 32-bit mode (even on 64-bit CPU if an code is compiled for 32-bit), they are stored as two separate 32-bit numbers, one storing higher bits of a number, and another storing lower bits. 当您在32位模式下使用64位数字时(即使在64位CPU上,如果代码编译为32位),它们也会存储为两个独立的32位数字,一个存储一个数字的高位,以及另一个存储低位。 The impact of this depends on an instruction. 这种影响取决于指令。 (tl;dr - generally, doing 64-bit math on 32-bit CPU is in theory 2 times slower, as long you don't divide/modulo, however in practice the difference is going to be smaller (1.3x would be my guess), because usually programs don't just do math on 64-bit integers, and also because of pipelining, the difference may be much smaller in your program). (tl; dr - 一般来说,在32位CPU上进行64位数学运算理论上要慢2倍,只要你不分/模,但实际上差异会更小(1.3x将是我的因为通常程序不只是对64位整数进行数学运算,而且由于流水线操作,所以程序中的差异可能要小得多)。

Addition/subtraction 加法/减法

Many architectures support so called carry flag . 许多架构支持所谓的进位标志 It's set when the result of addition overflows, or result of subtraction doesn't underflow. 当加法结果溢出或减法结果不下溢时设置。 The behaviour of those bits can be show with long addition and long subtraction. 这些位的行为可以通过长加法和长加法来显示。 C in this example shows either a bit higher than the highest representable bit (during operation), or a carry flag (after operation). 此示例中的C表示比最高可表示位(在操作期间)高一位或者进位标志(在操作之后)。

  C 7 6 5 4 3 2 1 0      C 7 6 5 4 3 2 1 0
  0 1 1 1 1 1 1 1 1      1 0 0 0 0 0 0 0 0
+   0 0 0 0 0 0 0 1    -   0 0 0 0 0 0 0 1
= 1 0 0 0 0 0 0 0 0    = 0 1 1 1 1 1 1 1 1

Why is carry flag relevant? 为什么携带标志相关? Well, it just so happens that CPUs usually have two separate addition and subtraction operations. 好吧,只是因为CPU通常有两个单独的加法和减法操作。 In x86, the addition operations are called add and adc . 在x86中,添加操作称为addadc add stands for addition, while adc for addition with carry. add支架以添加,而adc add支架。 The difference between those is that adc considers a carry bit, and if it is set, it adds one to the result. 它们之间的区别在于adc认为是进位,如果设置了,它会在结果中加一。

Similarly, subtraction with carry subtracts 1 from the result if carry bit is not set. 类似地,如果未设置进位,则使用进位减法从结果中减去1。

This behaviour allows easily implementing arbitrary size addition and subtraction on integers. 此行为允许在整数上轻松实现任意大小的加法和减法。 The result of addition of x and y (assuming those are 8-bit) is never bigger than 0x1FE . 添加xy (假设它们是8位)的结果永远不会大于0x1FE If you add 1 , you get 0x1FF . 如果加1 ,则得到0x1FF 9 bits is enough therefore to represent results of any 8-bit addition. 因此,9位足以表示任何8位加法的结果。 If you start addition with add , and then add any bits beyond initial ones with adc , you can do addition on any size of data you like. 如果您使用add开始add ,然后使用adc添加除初始位之外的任何位,则可以对您喜欢的任何大小的数据进行添加。

Addition of two 64-bit values on 32-bit CPU is as follows. 在32位CPU上添加两个64位值如下。

  1. Add first 32 bits of b to first 32 bits of a . B的前32位添加到前32位。
  2. Add with carry later 32 bits of b to later 32 bits of a . 添加与进位后32位b更高的32位。

Analogically for subtraction. 类似地用于减法。

This gives 2 instructions, however, because of instruction pipelinining , it may be slower than that, as one calculation depends on the other one to finish, so if CPU doesn't have anything else to do than 64-bit addition, CPU may wait for the first addition to be done. 这给出了2个指令,但是,由于指令管道 ,它可能比这慢,因为一个计算依赖于另一个计算完成,所以如果CPU除了64位之外没有任何其他操作,CPU可能会等待第一次加入。

Multiplication 乘法

It so happens on x86 that imul and mul can be used in such a way that overflow is stored in edx register. 在x86上imul如此, imulmul可以以溢出存储在edx寄存器中的方式使用。 Therefore, multiplying two 32-bit values to get 64-bit value is really easy. 因此,将两个32位值相乘以获得64位值非常容易。 Such a multiplication is one instruction, but to make use of it, one of multiplication values must be stored in eax . 这样的乘法是一条指令,但是为了利用它,乘法值之一必须存储在eax中

Anyway, for a more general case of multiplication of two 64-bit values, they can be calculated using a following formula (assume function r removes bits beyond 32 bits). 无论如何,对于两个64位值相乘的更一般情况,可以使用以下公式计算它们(假设函数r移除超过32位的位)。

First of all, it's easy to notice the lower 32 bits of a result will be multiplication of lower 32 bits of multiplied variables. 首先,很容易注意到结果的低32位将乘以低位32位的乘法变量。 This is due to congrugence relation. 这是由于一致关系。

a 1b 1 (mod n ) 1≡b 1(MOD N)
a 2b 2 (mod n ) 2≡B 2(MOD N)
a 1 a 2b 1 b 2 (mod n ) A 1 A 2≡b 1 B 2(MOD N)

Therefore, the task is limited to just determining the higher 32 bits. 因此,任务仅限于确定较高的32位。 To calculate higher 32 bits of a result, following values should be added together. 要计算结果的高32位,应将以下值加在一起。

  • Higher 32 bits of multiplication of both lower 32 bits (overflow which CPU can store in edx ) 低32位的高32位乘法(CPU可以存储在edx中的溢出)
  • Higher 32 bits of first variable mulitplied with lower 32 bits of second variable 第一个变量的高32位多用第二个变量的低32位
  • Lower 32 bits of first variable multiplied with higher 32 bits of second variable 第一个变量的低32位乘以第二个变量的高32位

This gives about 5 instructions, however because of relatively limited number of registers in x86 (ignoring extensions to an architecture), they cannot take too much advantage of pipelining. 这给出了大约5条指令,但是由于x86中的寄存​​器数量相对有限(忽略了对体系结构的扩展),它们不能过多地利用流水线技术。 Enable SSE if you want to improve speed of multiplication, as this increases number of registers. 如果要提高乘法速度,请启用SSE,因为这会增加寄存器的数量。

Division/Modulo (both are similar in implementation) Division / Modulo(两者在实现上类似)

I don't know how it works, but it's much more complex than addition, subtraction or even multiplication. 我不知道它是如何工作的,但它比加法,减法甚至乘法复杂得多。 It's likely to be ten times slower than division on 64-bit CPU however. 然而,它可能比64位CPU上的分区慢十倍。 Check "Art of Computer Programming, Volume 2: Seminumerical Algorithms", page 257 for more details if you can understand it (I cannot in a way that I could explain it, unfortunately). 如果您能理解它,请查看“计算机编程艺术,第2卷:精神数学算法”,第257页,以获取更多详细信息(遗憾的是,我不能以某种方式解释它)。

If you divide by a power of 2, please refer to shifting section, because that's what essentially compiler can optimize division to (plus adding the most significant bit before shifting for signed numbers). 如果除以2的幂,请参考移位部分,因为基本上编译器可以优化除法(加上在移位有符号数之前加上最高有效位)。

Or/And/Xor 和/或/ XOR

Considering those operations are single bit operations, nothing special happens here, just bitwise operation is done twice. 考虑到这些操作是单位操作,这里没有什么特别的事情发生,只需按位操作两次。

Shifting left/right 向左/向右移动

Interestingly, x86 actually has an instruction to perform 64-bit left shift called shld , which instead of replacing the least significant bits of value with zeros, it replaces them with most significant bits of a different register. 有趣的是,x86实际上有一个执行64位左移的指令,称为shld ,它不是用零替换值的最低有效位,而是用不同寄存器的最高有效位替换它们。 Similarly, it's the case for right shift with shrd instruction. 类似地,使用shrd指令进行右移也是如此。 This would easily make 64-bit shifting a two instructions operation. 这很容易使64位移位两个指令操作。

However, that's only a case for constant shifts. 然而,这只是不断变化的情况。 When a shift is not constant, things get tricker, as x86 architecture only supports shift with 0-31 as a value. 当一个班次不是一成不变的时候,事情变得越来越棘手,因为x86架构只支持0-31作为值的转变。 Anything beyond that is according to official documentation undefined, and in practice, bitwise and operation with 0x1F is performed on a value. 除此之外的任何内容都是根据官方文档未定义的,并且在实践中,按位执行并且对值执行0x1F操作。 Therefore, when a shift value is higher than 31, one of value storages is erased entirely (for left shift, that's lower bytes, for right shift, that's higher bytes). 因此,当一个移位值高于31时,一个值存储器被完全擦除(对于左移,这是较低的字节,对于右移,这是较高的字节)。 The other one gets the value that was in the register that was erased, and then shift operation is performed. 另一个获取已擦除的寄存器中的值,然后执行移位操作。 This in result, depends on branch predictor to make good predictions, and is a bit slower because a value needs to be checked. 结果,这取决于分支预测器做出良好的预测,并且因为需要检查值而稍微慢一点。

__builtin_popcount[ll] __builtin_popcount [11]

__builtin_popcount(lower) + __builtin_popcount(higher) __builtin_popcount(lower)+ __builtin_popcount(更高)

Other builtins 其他内置组件

I'm too lazy to finish the answer at this point. 我现在懒得完成答案。 Does anyone even use those? 有没有人甚至使用那些?

Unsigned vs signed 未签名vs签名

Addition, subtraction, multiplication, or, and, xor, shift left generate the exact same code. 加法,减法,乘法,或者,和,xor,左移,生成完全相同的代码。 Shift right uses only slightly different code (arithmetic shift vs logical shift), but structurally it's the same. 右移仅使用稍微不同的代码(算术移位与逻辑移位),但在结构上它是相同的。 It's likely that division does generate a different code however, and signed division is likely to be slower than unsigned division. 然而,除法确实会生成不同的代码,并且有符号除法可能比无符号除法慢。

Benchmarks 基准

Benchmarks? 基准? They are mostly meaningless, as instruction pipelining will usually lead to things being faster when you don't constantly repeat the same operation. 它们大多没有意义,因为当你不经常重复相同的操作时,指令流水线通常会导致事情变得更快。 Feel free to consider division slow, but nothing else really is, and when you get outside of benchmarks, you may notice that because of pipelining, doing 64-bit operations on 32-bit CPU is not slow at all. 您可以随意考虑除法,但实际上没有其他内容,当您超出基准测试时,您可能会注意到由于流水线操作,在32位CPU上执行64位操作并不是很慢。

Benchmark your own application, don't trust micro-benchmarks that don't do what your application does. 对您自己的应用程序进行基准测试,不要相信不能执行您的应用程序的微基准测试。 Modern CPUs are quite tricky, so unrelated benchmarks can and will lie. 现代CPU非常棘手,所以不相关的基准测试可以而且将会存在。

Your question sounds pretty weird in its environment. 你的问题在它的环境中听起来很奇怪。 You use time_t that uses up 32 bits. 使用最多占用32位的time_t。 You need additional info, what means more bits. 您需要其他信息,这意味着更多的比特。 So you are forced to use something bigger than int32. 所以你被迫使用比int32更大的东西。 It doesn't matter what the performance is, right? 性能是什么并不重要,对吧? Choices will go between using just say 40 bits or go ahead to int64. 只需说40位就可以选择,也可以继续使用int64。 Unless millions of instances must be stored of it, the latter is a sensible choice. 除非必须存储数百万个实例,否则后者是明智的选择。

As others pointed out the only way to know the true performance is to measure it with profiler, (in some gross samples a simple clock will do). 正如其他人指出的那样,了解真实性能的唯一方法是使用分析器来测量它(在一些简单的时钟中可以做一些大样本)。 so just go ahead and measure. 所以,继续前进并衡量。 It must not be hard to globalreplace your time_t usage to a typedef and redefine it to 64 bit and patch up the few instances where real time_t was expected. 将time_t用法全局替换为typedef并将其重新定义为64位并修补预期有realtime_t的少数实例一定不难。

My bet would be on "unmeasurable difference" unless your current time_t instances take up at least a few megs of memory. 除非你当前的time_t实例占用至少几兆内存,否则我的赌注将是“无法衡量的差异”。 on current Intel-like platforms the cores spend most of the time waiting for external memory to get into cache. 在当前类似英特尔的平台上,内核大部分时间都在等待外部存储器进入缓存。 A single cache miss stalls for hundred(s) of cycles. 单个高速缓存未命中会停止一百个周期。 What makes calculating 1-tick differences on instructions infeasible. 是什么让计算1-tick差异的指令变得不可行。 Your real performance may drop due yo things like your current structure just fits a cache line and the bigger one needs two. 您的实际性能可能会下降,因为您当前的结构只适合缓存行,而较大的需要两个。 And if you never measured your current performance you might discover that you could gain extreme speedup of some funcitons just by adding some alignment or exchange order of some members in a structure. 如果您从未测量过当前的性能,您可能会发现只需在结构中添加某些成员的一些对齐或交换顺序,就可以获得某些功能的极端加速。 Or pack(1) the structure instead of using the default layout... 或者打包(1)结构而不是使用默认布局......

Addition/subtraction basically becomes two cycles each, multiplication and division depend on the actual CPU. 加法/减法基本上分别变为两个周期,乘法和除法取决于实际的CPU。 The general perfomance impact will be rather low. 一般性能影响将相当低。

Note that Intel Core 2 supports EM64T. 请注意,Intel Core 2支持EM64T。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM