简体   繁体   English

在Windows上,试用版代码的运行速度比32位快32倍,而在Linux上则高于64位

[英]Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux

I have a piece of code that runs 2x faster on windows than on linux. 我有一段代码,在Windows上运行速度比在linux上快2倍。 Here are the times I measured: 这是我测量的时间:

g++ -Ofast -march=native -m64
    29.1123
g++ -Ofast -march=native
    29.0497
clang++ -Ofast -march=native
    28.9192
visual studio 2013 Debug 32b
    13.8802
visual studio 2013 Release 32b
    12.5569

It really seems to be too huge a difference. 这似乎是一个太大的差异。

Here is the code: 这是代码:

#include <iostream>
#include <map>
#include <chrono>
static std::size_t Count = 1000;

static std::size_t MaxNum = 50000000;

bool IsPrime(std::size_t num)
{
    for (std::size_t i = 2; i < num; i++)
    {
        if (num % i == 0)
            return false;
    }
    return true;
}

int main()
{
    auto start = std::chrono::steady_clock::now();
    std::map<std::size_t, bool> value;
    for (std::size_t i = 0; i < Count; i++)
    {
        value[i] = IsPrime(i);
        value[MaxNum - i] = IsPrime(MaxNum - i);
    }
    std::chrono::duration<double> serialTime = std::chrono::steady_clock::now() - start;
    std::cout << "Serial time = " << serialTime.count() << std::endl;

    system("pause");
    return 0;
}

All of this was measured on the same machine with windows 8 vs linux 3.19.5(gcc 4.9.2, clang 3.5.0). 所有这些都是在Windows 8与Linux 3.19.5(gcc 4.9.2,clang 3.5.0)的同一台机器上测得的。 Both linux and windows are 64bit. linux和windows都是64位。

What could be the reason for this? 这可能是什么原因? Some scheduler issues? 一些调度问题?

You don't say whether the windows/linux operating systems are 32 or 64 bit. 你没有说windows / linux操作系统是32位还是64位。

On a 64-bit linux machine, if you change the size_t to an int you'll find that execution times drop on linux to a similar value to those that you have for windows. 在64位Linux机器上,如果将size_t更改为int,您会发现linux上的执行时间会降低到与Windows相同的值。

size_t is an int32 on win32, an int64 on win64. size_t是win32上的int32,win64上的int64。

EDIT: just seen your windows disassembly. 编辑:刚刚看到你的窗户拆卸。

Your windows OS is the 32-bit variety (or at least you've compiled for 32-bit). 你的Windows操作系统是32位的(或至少你编译为32位)。

size_t is a 64-bit unsigned type in the x86-64 System V ABI on Linux, where you're compiling a 64-bit binary. size_t是Linux上x86-64 System V ABI中的64位无符号类型,您可以在其中编译64位二进制文​​件。 But in a 32-bit binary (like you're making on Windows), it's only 32-bit, and thus the trial-division loop is only doing 32-bit division. 但是在32位二进制文​​件中(就像你在Windows上制作的那样),它只有32位,因此试分循环只进行32位除法。 ( size_t is for sizes of C++ objects, not files, so it only needs to be pointer width.) size_t用于C ++对象的大小,而不是文件,因此它只需要是指针宽度。)

On x86-64 Linux, -m64 is the default, because 32-bit is basically considered obsolete. 在x86-64 Linux上, -m64是默认值,因为32位基本上被认为是过时的。 To make a 32-bit executable, use g++ -m32 . 要制作32位可执行文件,请使用g++ -m32


Unlike most integer operations, division throughput (and latency) on modern x86 CPUs depends on the operand-size: 64-bit division is slower than 32-bit division. 与大多数整数操作不同,现代x86 CPU上的除法吞吐量(和延迟)取决于操作数大小:64位除法比32位除法慢。 ( https://agner.org/optimize/ for tables of instruction throughput / latency / uops for which ports). https://agner.org/optimize/表示端口的指令吞吐量/延迟/ uops表)。

And it's very slow compared to other operations like multiply or especially add: your program completely bottlenecks on integer division throughput, not on the map operations. 与其他操作(如乘法或特别添加)相比,它非常慢:您的程序在整数除法吞吐量上完全出现瓶颈,而不是在map操作上。 (With perf counters for a 32-bit binary on Skylake, arith.divider_active counts 24.03 billion cycles that the divide execution unit was active, out of 24.84 billion core clock cycles total. Yes that's right, division is so slow that there's a performance counter just for that execution unit. It's also a special case because it's not fully pipelined, so even in a case like this where you have independent divisions, it can't start a new one every clock cycle like it can for other multi-cycle operations like FP or integer multiply.) (配PERF计数器上SKYLAKE微架构一个32位的二进制, arith.divider_active计数24.03十亿周期,所述除法执行单元是活动的,总分24.84十亿核心时钟周期总数。是这是正确的,分裂是如此之慢,有一个性能计数器只是为了那个执行单元。它也是一个特殊情况,因为它没有完全流水线化,所以即使在这样的情况下你有独立的分区,它也不能像每个时钟周期那样为其他多周期操作启动一个新的像FP或整数乘法。)

g++ unfortunately fails to optimize based on the fact that the numbers are compile-time constants and thus have limited ranges. 遗憾的是,基于数字是编译时常数并因此具有有限范围的事实,g ++无法优化。 It would be legal (and a huge speedup) for g++ -m64 to optimize to div ecx instead of div rcx . g++ -m64优化到div ecx而不是div rcx是合法的(并且是一个巨大的加速)。 That change makes the 64-bit binary run as fast as the 32-bit binary. 这种变化使得64位二进制文​​件的运行速度与32位二进制文​​件一样快。 (It's computing exactly the same thing, just without as many high zero bits. The result is implicitly zero-extended to fill the 64-bit register, instead of explicitly calculated as zero by the divider, and that's much faster in this case.) (它正在计算完全相同的东西,只是没有那么多的高零位。结果是隐式零扩展以填充64位寄存器,而不是由分频器明确地计算为零,并且在这种情况下更快。)

I verified this on Skylake by editing the binary to replace the 0x48 REX.W prefix with 0x40 , changing div rcx into div ecx with a do-nothing REX prefix. 我在Skylake上通过编辑二进制文件来验证这一点,用0x40替换0x48 REX.W前缀 ,将div rcx更改为div ecx并带有do-nothing REX前缀。 The total cycles taken was within 1% of the 32-bit binary from g++ -O3 -m32 -march=native . 所采用的总周期在g++ -O3 -m32 -march=native的32位二进制文​​件的1%范围内。 (And time, since the CPU happened to be running at the same clock speed for both runs.) ( g++7.3 asm output on the Godbolt compiler explorer .) (还有时间,因为两次运行时CPU恰好以相同的时钟速度运行。)( Godbolt编译器浏览器上的g ++ 7.3 asm输出 。)

32-bit code, gcc7.3 -O3 on a 3.9GHz Skylake i7-6700k running Linux 运行Linux的3.9GHz Skylake i7-6700k上的32位代码,gcc7.3 -O3

$ cat > primes.cpp     # and paste your code, then edit to remove the silly system("pause")
$ g++ -Ofast -march=native -m32 primes.cpp -o prime32

$ taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,arith.divider_active  ./prime32 
Serial time = 6.37695


 Performance counter stats for './prime32':
       6377.915381      task-clock (msec)         #    1.000 CPUs utilized          
                66      context-switches          #    0.010 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               111      page-faults               #    0.017 K/sec                  
    24,843,147,246      cycles                    #    3.895 GHz                    
     6,209,323,281      branches                  #  973.566 M/sec                  
    24,846,631,255      instructions              #    1.00  insn per cycle         
    49,663,976,413      uops_issued.any           # 7786.867 M/sec                  
    40,368,420,246      uops_executed.thread      # 6329.407 M/sec                  
    24,026,890,696      arith.divider_active      # 3767.201 M/sec                  

       6.378365398 seconds time elapsed

vs. 64-bit with REX.W=0 (hand-edited binary) 64位,REX.W = 0(手工编辑的二进制)

 Performance counter stats for './prime64.div32':

       6399.385863      task-clock (msec)         #    1.000 CPUs utilized          
                69      context-switches          #    0.011 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               146      page-faults               #    0.023 K/sec                  
    24,938,804,081      cycles                    #    3.897 GHz                    
     6,209,114,782      branches                  #  970.267 M/sec                  
    24,845,723,992      instructions              #    1.00  insn per cycle         
    49,662,777,865      uops_issued.any           # 7760.554 M/sec                  
    40,366,734,518      uops_executed.thread      # 6307.908 M/sec                  
    24,045,288,378      arith.divider_active      # 3757.437 M/sec                  

       6.399836443 seconds time elapsed

vs. the original 64-bit binary : 原始的64位二进制文​​件

$ g++ -Ofast -march=native primes.cpp -o prime64
$ taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,arith.divider_active  ./prime64
Serial time = 20.1916

 Performance counter stats for './prime64':

      20193.891072      task-clock (msec)         #    1.000 CPUs utilized          
                48      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               148      page-faults               #    0.007 K/sec                  
    78,733,701,858      cycles                    #    3.899 GHz                    
     6,225,969,960      branches                  #  308.310 M/sec                  
    24,930,415,081      instructions              #    0.32  insn per cycle         
   127,285,602,089      uops_issued.any           # 6303.174 M/sec                  
   111,797,662,287      uops_executed.thread      # 5536.212 M/sec                  
    27,904,367,637      arith.divider_active      # 1381.822 M/sec                  

      20.193208642 seconds time elapsed

IDK why the performance counter for arith.divider_active didn't go up more. IDK为什么arith.divider_active的性能计数器没有增加。 div 64 is significantly more uops than div r32 , so possibly it hurts out-of-order execution and reduces overlap of surrounding code. div 64div r32明显更多div r32 ,因此它可能会伤害无序执行并减少周围代码的重叠。 But we know that back-to-back div with no other instructions has a similar performance difference. 但我们知道,没有其他指令的背靠背div具有类似的性能差异。

And anyway, this code spends most of its time in that terrible trial-division loop (which checks every odd and even divisor, even though we can already rule out all the even divisors after checking the low bit... And which checks all the way up to num instead of sqrt(num) , so it's horribly slow for very large primes .) 无论如何,这段代码大部分时间都花在那个可怕的试分循环中(它检查每个奇数和偶数除数,即使我们已经可以在检查低位后排除所有偶数除数... 并检查所有的除数最多可达num而不是sqrt(num) ,所以对于非常大的素数来说它非常 。)

According to perf record , 99.98% of the cpu cycles events fired in the 2nd trial-division loop, the one MaxNum - i , so div was still the entire bottleneck, and it's just a quirk of performance counters that not all the time was recorded as arith.divider_active 根据perf record ,99.98%的cpu周期事件在第二次试验循环中触发,一个MaxNum - i ,所以div仍然是整个瓶颈,而这只是性能计数器的一个怪癖,并非所有时间都记录下来as arith.divider_active

  3.92 │1e8:   mov    rax,rbp
  0.02 │       xor    edx,edx
 95.99 │       div    rcx
  0.05 │       test   rdx,rdx 
       │     ↓ je     238     
  ... loop counter logic to increment rcx

From Agner Fog's instruction tables for Skylake: 来自Agner Fog的Skylake指令表:

           uops    uops      ports          latency     recip tput
           fused   unfused
DIV r32     10     10       p0 p1 p5 p6     26           6
DIV r64     36     36       p0 p1 p5 p6     35-88        21-83

( div r64 itself is actually data-dependent on the actual size of its inputs, with small inputs being faster. The really slow cases are with very large quotients, IIRC. And probably also slower when the upper half of the 128-bit dividend in RDX:RAX is non-zero. C compilers typically only ever use div with rdx=0 .) div r64本身实际上是数据依赖于其输入的实际大小,小输入更快。 真正慢的情况是具有非常大的商,IIRC。并且当128位红利的上半部分时,也可能更慢RDX:RAX非零.C编译器通常只使用rdx=0 div 。)

The ratio of the cycle counts ( 78733701858 / 24938804081 = ~3.15 ) is actually smaller than the ratio of best-case throughputs ( 21/6 = 3.5 ). 循环计数的比率( 78733701858 / 24938804081 = ~3.15 )实际上小于最佳情况吞吐量的比率( 21/6 = 3.5 )。 It should be a pure throughput bottleneck, not latency, because the next loop iteration can start without waiting for the last division result. 它应该是一个纯粹的吞吐量瓶颈,而不是延迟,因为下一个循环迭代可以在不等待最后一个分割结果的情况下开始。 (Thanks to branch prediction + speculative execution.) Maybe there are some branch misses in that division loop. (感谢分支预测+推测执行。)也许在该分区循环中存在一些分支未命中。

If you only found a 2x performance ratio, then you have a different CPU. 如果您只找到2倍的性能比,那么您将拥有不同的CPU。 Possibly Haswell, where 32-bit div throughput is 9-11 cycles, and 64-bit div throughput is 21-74. 可能是Haswell,其中32位div吞吐量为9-11个周期,64位div吞吐量为21-74。

Probably not AMD: the best-case throughputs there are still small even for div r64 . 可能不是AMD:即使对于div r64 ,最佳情况下的吞吐量仍然很小。 eg Steamroller has div r32 throughput = 1 per 13-39 cycles, and div r64 = 13-70. 例如,Steamroller的div r32吞吐量=每13-39个周期1个, div r64 = 13-70。 I'd guess that with the same actual numbers, you'd probably get the same performance even if you give them to the divider in wider registers, unlike Intel. 我猜想,使用相同的实际数字,即使你将它们分配给更宽的寄存器中的分频器,你也可能获得相同的性能,这与英特尔不同。 (The worst-case goes up because the possible size of input and result is larger.) AMD integer division is only 2 uops, unlike Intel's which is microcoded as 10 or 36 uops on Skylake. (最糟糕的情况是因为输入和结果的可能大小更大。)AMD整数除法仅为2 uops,不像Intel的,在Skylake上被微编码为10或36 uop。 (And even more for signed idiv r64 at 57 uops.) This is probably related to AMD being efficient for small numbers in wide registers. (对于57 idiv r64签名idiv r64甚至更多。)这可能与AMD对宽寄存器中的小数字有效有关。

BTW, FP division is always single-uop, because it's more performance-critical in normal code. 顺便说一句,FP部门总是单一的,因为它在普通代码中更具性能要求。 (Hint: nobody uses totally naive trial-division in real life for checking multiple primes if they care about performance at all . Sieve or something.) (提示:没有人在现实生活中使用完全天真的试验分区来检查多个素数,如果他们关心性能的话 。筛选或其他东西。)


The key for the ordered map is a size_t , and pointers are larger in 64-bit code, making each red-black tree node significantly larger, but that's not the bottleneck . 有序map的关键是size_t ,64位代码中的指针更大,使得每个红黑树节点明显更大,但这不是瓶颈

BTW, map<> is a terrible choice here vs. two arrays of bool prime_low[Count], prime_high[Count] : one for the low Count elements and one for the high Count . 顺便说一句, map<>在这里是一个可怕的选择与两个bool prime_low[Count], prime_high[Count]数组bool prime_low[Count], prime_high[Count] :一个用于低Count元素,一个用于高Count You have 2 contiguous ranges, to the key can be implicit by position. 你有2个连续的范围,键可以隐含位置。 Or at least use a std::unordered_map hash table. 或者至少使用std::unordered_map哈希表。 I feel like the ordered version should have been called ordered_map , and map = unordered_map , because you often see code using map without taking advantage of the ordering. 我觉得有序版本应该被称为ordered_mapmap = unordered_map ,因为你经常看到代码使用map而不利用排序。

You could even use a std::vector<bool> to get a bitmap, using 1/8th the cache footprint. 您甚至可以使用std::vector<bool>来获取位图,使用1/8的缓存占用空间。

There is an "x32" ABI (32-bit pointers in long mode) which has the best of both worlds for processes that don't need more than 4G of virtual address space: small pointers for higher data density / smaller cache footprint in pointer-heavy data structures, but the advantages of a modern calling convention, more registers, baseline SSE2, and 64-bit integer registers for when you do need 64-bit math. 有一个“x32”ABI(长模式下的32位指针),它对于不需要超过4G虚拟地址空间的进程具有两全其美的优势:小指针用于提高数据密度/指针中较小的缓存占用空间 - 重要的数据结构,但现代调用约定的优点,更多的寄存器,基线SSE2和64位整数寄存器,当您需要64位数学时。 But unfortunately it's not very popular. 但不幸的是,它并不是很受欢迎。 It's only a little faster, so most people don't want a third version of every library. 它只是快一点,所以大多数人不想要每个库的第三个版本。

In this case, you could fix the source to use unsigned int (or uint32_t if you want to be portable to systems where int is only 16 bit). 在这种情况下,您可以修复源以使用unsigned int (如果您想要移植到int只有16位的系统,则可以使用 uint32_t )。 Or uint_least32_t to avoid requiring a fixed-width type. uint_least32_t以避免需要固定宽度类型。 You could do this only for the arg to IsPrime , or for the data structure as well. 您只能对arg到IsPrime或数据结构执行此操作。 (But if you're optimizing, the key is implicit by position in an array, not explicit.) (但是如果你正在优化,那么关键是隐含在数组中的位置,而不是显式的。)

You could even make a version of IsPrime with a 64-bit loop and a 32-bit loop, which selects based on the size of the input. 您甚至可以使用64位循环和32位循环IsPrime版本,该循环根据输入的大小进行选择。

Extracted answer from the edited question: 从编辑过的问题中提取答案:

It was caused by building 32b binaries on windows as opposed to 64b binaries on linux, here are 64b numbers for windows: 它是由于在windows上构建32b二进制文件而不是在linux上构建64b二进制文件引起的,这里是64b的Windows数字:

Visual studio 2013 Debug 64b
    29.1985
Visual studio 2013 Release 64b
    29.7469

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM