简体   繁体   English

内置类型的性能:char vs short vs int vs. float vs. double

[英]Performance of built-in types : char vs short vs int vs. float vs. double

This may appear to be a bit stupid question but seeing Alexandre C's reply in the other topic, I'm curious to know that if there is any performance difference with the built-in types: 这可能看起来有点愚蠢但是看到Alexandre C在其他主题中的回复 ,我很想知道如果内置类型有任何性能差异:

char vs short vs int vs. float vs. double . char vs short vs int vs. float vs. double

Usually we don't consider such performance difference (if any) in our real life projects, but I would like to know this for educational purpose. 通常我们在现实生活中没有考虑这种性能差异(如果有的话),但我想知道这是出于教育目的。 The general questions can be asked is: 可以问的一般问题是:

  • Is there any performance difference between integral arithmetics and floating-point arithmetic? 整数算术和浮点运算之间是否有任何性能差异?

  • Which is faster? 哪个更快? What is the reason for being faster? 更快的原因是什么? Please explain this. 请解释一下。

Float vs. integer: 浮点数与整数:

Historically, floating-point could be much slower than integer arithmetic. 从历史上看,浮点可能比整数运算慢得多。 On modern computers, this is no longer really the case (it is somewhat slower on some platforms, but unless you write perfect code and optimize for every cycle, the difference will be swamped by the other inefficiencies in your code). 在现代计算机上,情况已经不再如此(在某些平台上速度稍慢,但除非您编写完美的代码并针对每个周期进行优化,否则差异将被代码中的其他低效率所淹没)。

On somewhat limited processors, like those in high-end cell phones, floating-point may be somewhat slower than integer, but it's generally within an order of magnitude (or better), so long as there is hardware floating-point available. 在有限的处理器上,如高端手机中的处理器,浮点可能比整数慢一些,但只要有硬件浮点可用,它通常在一个数量级(或更好)的范围内。 It's worth noting that this gap is closing pretty rapidly as cell phones are called on to run more and more general computing workloads. 值得注意的是,随着手机被要求运行越来越多的通用计算工作负载,这种差距正在迅速缩小。

On very limited processors (cheap cell phones and your toaster), there is generally no floating-point hardware, so floating-point operations need to be emulated in software. 非常有限的处理器(便宜的手机和烤面包机)上,通常没有浮点硬件,因此需要在软件中模拟浮点运算。 This is slow -- a couple orders of magnitude slower than integer arithmetic. 这很慢 - 比整数算术慢几个数量级。

As I said though, people are expecting their phones and other devices to behave more and more like "real computers", and hardware designers are rapidly beefing up FPUs to meet that demand. 正如我所说,人们期望他们的手机和其他设备的行为越来越像“真正的计算机”,硬件设计师正在迅速加强FPU以满足这种需求。 Unless you're chasing every last cycle, or you're writing code for very limited CPUs that have little or no floating-point support, the performance distinction doesn't matter to you. 除非您追逐每个循环,或者您正在为极少或没有浮点支持的非常有限的CPU编写代码,否则性能差异对您来说无关紧要。

Different size integer types: 不同大小的整数类型:

Typically, CPUs are fastest at operating on integers of their native word size (with some caveats about 64-bit systems). 通常, CPU在其原始字大小的整数上运行速度最快(有一些关于64位系统的警告)。 32 bit operations are often faster than 8- or 16- bit operations on modern CPUs, but this varies quite a bit between architectures. 在现代CPU上,32位操作通常比8位或16位操作更快,但这在架构之间会有很大差异。 Also, remember that you can't consider the speed of a CPU in isolation; 另外,请记住,您无法单独考虑CPU的速度; it's part of a complex system. 它是复杂系统的一部分。 Even if operating on 16-bit numbers is 2x slower than operating on 32-bit numbers, you can fit twice as much data into the cache hierarchy when you represent it with 16-bit numbers instead of 32-bits. 即使在16位数上运行比在32位数上运行慢2倍,当您使用16位数而不是32位表示数据时,可以将两倍的数据放入缓存层次结构中。 If that makes the difference between having all your data come from cache instead of taking frequent cache misses, then the faster memory access will trump the slower operation of the CPU. 如果这使得所有数据都来自缓存而不是经常缓存未命中之间的区别,那么更快的内存访问将胜过CPU的较慢操作。

Other notes: 其他说明:

Vectorization tips the balance further in favor of narrower types ( float and 8- and 16-bit integers) -- you can do more operations in a vector of the same width. 矢量化进一步提升了平衡,支持更窄的类型( float和8位和16位整数) - 您可以在相同宽度的矢量中执行更多操作。 However, good vector code is hard to write, so it's not as though you get this benefit without a lot of careful work. 但是,良好的矢量代码很难编写,因此,如果没有大量细致的工作,就不会获得这种好处。

Why are there performance differences? 为什么会出现性能差异?

There are really only two factors that effect whether or not an operation is fast on a CPU: the circuit complexity of the operation, and user demand for the operation to be fast. 实际上只有两个因素会影响CPU上的操作是否快速:操作的电路复杂性以及用户对操作的快速需求。

(Within reason) any operation can be made fast, if the chip designers are willing to throw enough transistors at the problem. (在合理范围内)如果芯片设计者愿意在这个问题上投入足够的晶体管,那么任何操作都可以快速完成。 But transistors cost money (or rather, using lots of transistors makes your chip larger, which means you get fewer chips per wafer and lower yields, which costs money), so chip designers have to balance how much complexity to use for which operations, and they do this based on (perceived) user demand. 但晶体管需要花钱(或者更确切地说,使用大量晶体管会使您的芯片变大,这意味着您每个晶圆的芯片数量减少,产量降低,这会花费金钱),因此芯片设计人员必须平衡使用多少操作的复杂性,以及他们根据(感知的)用户需求来做这件事。 Roughly, you might think of breaking operations into four categories: 粗略地说,您可能会考虑将操作分为四类:

                 high demand            low demand
high complexity  FP add, multiply       division
low complexity   integer add            popcount, hcf
                 boolean ops, shifts

high-demand, low-complexity operations will be fast on nearly any CPU: they're the low-hanging fruit, and confer maximum user benefit per transistor. 高需求,低复杂度的操作几乎可以在任何CPU上快速运行:它们是低成本的结果,并为每个晶体管带来最大的用户利益。

high-demand, high-complexity operations will be fast on expensive CPUs (like those used in computers), because users are willing to pay for them. 高需求,高复杂度的操作将在昂贵的CPU(如计算机中使用的CPU)上快速运行,因为用户愿意为它们付费。 You're probably not willing to pay an extra $3 for your toaster to have a fast FP multiply, however, so cheap CPUs will skimp on these instructions. 你可能不愿意为你的烤面包机额外支付3美元来快速增加FP,但是,如此便宜的CPU会吝啬这些指令。

low-demand, high-complexity operations will generally be slow on nearly all processors; 几乎所有处理器的低需求,高复杂度操作通常都会很慢; there just isn't enough benefit to justify the cost. 没有足够的好处来证明成本合理。

low-demand, low-complexity operations will be fast if someone bothers to think about them, and non-existent otherwise. 如果有人不愿意考虑它们,那么低需求,低复杂度的操作会很快,否则就不存在。

Further reading: 进一步阅读:

  • Agner Fog maintains a nice website with lots of discussion of low-level performance details (and has very scientific data collection methodology to back it up). Agner Fog维护着一个很好的网站,其中包含大量关于低级性能细节的讨论(并且有非常科学的数据收集方法来支持它)。
  • The Intel® 64 and IA-32 Architectures Optimization Reference Manual (PDF download link is part way down the page) covers a lot of these issues as well, though it is focused on one specific family of architectures. 英特尔®64和IA-32架构优化参考手册 (PDF下载链接是页面的一部分)也涵盖了很多这些问题,尽管它专注于一个特定的体系结构系列。

Absolutely. 绝对。

First, of course, it depends entirely on the CPU architecture in question. 首先,当然,它完全取决于所讨论的CPU架构。

However, integral and floating-point types are handled very differently, so the following is nearly always the case: 但是,积分和浮点类型的处理方式非常不同,因此以下情况几乎总是如此:

  • for simple operations, integral types are fast . 对于简单的操作,积分类型很快 For example, integer addition often has only a single cycle's latency, and integer multiplication is typically around 2-4 cycles, IIRC. 例如,整数加法通常只有一个周期的延迟,整数乘法通常约为2-4个周期,即IIRC。
  • Floating point types used to perform much slower. 浮点类型用于执行速度慢得多。 On today's CPUs, however, they have excellent throughput, and a each floating point unit can usually retire an operation per cycle, leading to the same (or similar) throughput as for integer operations. 然而,在今天的CPU上,它们具有出色的吞吐量,并且每个浮点单元通常可以在每个周期中退出操作,从而导致与整数运算相同(或类似)的吞吐量。 However, latency is generally worse. 但是,延迟通常更糟。 Floating-point addition often has a latency around 4 cycles (vs 1 for ints). 浮点加法通常具有大约4个周期的延迟(对于整数而言为1)。
  • for some complex operations, the situation is different, or even reversed. 对于一些复杂的操作,情况是不同的,甚至是逆转的。 For example, division on FP may have less latency than for integers, simply because the operation is complex to implement in both cases, but it is more commonly useful on FP values, so more effort (and transistors) may be spent optimizing that case. 例如,对FP的划分可能具有比整数更少的延迟,这仅仅是因为在两种情况下操作都很复杂,但它在FP值上更常用,因此可以花费更多精力(和晶体管)来优化该情况。

On some CPUs, doubles may be significantly slower than floats. 在某些CPU上,双精度可能比浮点数慢得多。 On some architectures, there is no dedicated hardware for doubles, and so they are handled by passing two float-sized chunks through, giving you a worse throughput and twice the latency. 在某些体系结构中,没有用于双精度的专用硬件,因此它们通过传递两个浮点大小的块来处理,从而使您的吞吐量更低,延迟时间延长两倍。 On others (the x86 FPU, for example), both types are converted to the same internal format 80-bit floating point, in the case of x86), so performance is identical. 在其他(例如x86 FPU)上,两种类型都转换为相同的内部格式80位浮点(在x86的情况下),因此性能相同。 On yet others, both float and double have proper hardware support, but because float has fewer bits, it can be done a bit faster, typically reducing the latency a bit relative to double operations. 在其他情况下,float和double都有适当的硬件支持,但由于float具有较少的位,因此可以更快地完成,通常相对于双操作减少一点延迟。

Disclaimer: all the mentioned timings and characteristics are just pulled from memory. 免责声明:所有提及的时间和特征都是从内存中提取的。 I didn't look any of it up, so it may be wrong. 我看起来没什么,所以可能是错的。 ;) ;)

For different integer types, the answer varies wildly depending on CPU architecture. 对于不同的整数类型,答案根据CPU架构而有很大差异。 The x86 architecture, due to its long convoluted history, has to support both 8, 16, 32 (and today 64) bit operations natively, and in general, they're all equally fast ( they use basically the same hardware, and just zero out the upper bits as needed). x86架构由于其悠久的历史,必须本身支持8,16,32(以及今天的64)位操作,并且通常它们都同样快速(它们使用基本相同的硬件,并且只有零根据需要输出高位。

However, on other CPUs, datatypes smaller than an int may be more costly to load/store (writing a byte to memory might have to be done by loading the entire 32-bit word it is located in, and then do bit masking to update the single byte in a register, and then write the whole word back). 但是,在其他CPU上,小于int数据类型加载/存储的成本可能更高(向内存写入一个字节可能必须通过加载它所在的整个32位字来完成,然后进行位掩码更新寄存器中的单个字节,然后将整个字写回)。 Likewise, for datatypes larger than int , some CPUs may have to split the operation into two, loading/storing/computing the lower and upper halves separately. 同样,对于大于int数据类型,某些CPU可能必须将操作拆分为两个,分别加载/存储/计算下半部分和上半部分。

But on x86, the answer is that it mostly doesn't matter. 但是在x86上,答案是它并不重要。 For historical reasons, the CPU is required to have pretty robust support for each and every data type. 由于历史原因,CPU需要对每种数据类型都提供非常强大的支持。 So the only difference you're likely to notice is that floating-point ops have more latency (but similar throughput, so they're not slower per se, at least if you write your code correctly) 因此,您可能注意到的唯一区别是浮点运算具有更多延迟(但吞吐量相似,因此它们本身并不 ,至少如果您正确编写代码)

I don't think anyone mentioned the integer promotion rules. 我认为没有人提到整数推广规则。 In standard C/C++, no operation can be performed on a type smaller than int . 在标准C / C ++中,不能对小于int的类型执行任何操作。 If char or short happen to be smaller than int on the current platform, they are implicitly promoted to int (which is a major source of bugs). 如果char或short恰好小于当前平台上的int,则它们被隐式提升为int(这是bug的主要来源)。 The complier is required to do this implicit promotion, there's no way around it without violating the standard. 编译器需要进行隐式升级,没有违反标准就无法绕过它。

The integer promotions mean that no operation (addition, bitwise, logical etc etc) in the language can occur on a smaller integer type than int. 整数提升意味着语言中的操作(加法,按位,逻辑等)不会出现在比int更小的整数类型上。 Thus, operations on char/short/int are generally equally fast, as the former ones are promoted to the latter. 因此,对char / short / int的操作通常同样快,因为前者被提升为后者。

And on top of the integer promotions, there's the "usual arithmetic conversions", meaning that C strives to make both operands the same type, converting one of them to the larger of the two, should they be different. 除了整数提升之外,还有“通常的算术转换”,这意味着C努力使两个操作数相同,如果它们不同,则将其中一个转换为两者中较大的一个。

However, the CPU can perform various load/store operations on 8, 16, 32 etc level. 但是,CPU可以在8,16,32等级执行各种加载/存储操作。 On 8- and 16 bit architectures, this often means that 8 and 16 bit types are faster despite the integer promotions. 在8位和16位架构上,这通常意味着尽管有整数提升,但8位和16位类型更快。 On a 32 bit CPU it might actually mean that the smaller types are slower , because it wants to have everything neatly aligned in 32-bit chunks. 在32位CPU上,它实际上可能意味着较小的类型较慢 ,因为它希望将所有内容整齐地排列在32位块中。 32 bit compilers typically optimize for speed and allocate smaller integer types in larger space than specified. 32位编译器通常优化速度,并在比指定空间更大的空间中分配更小的整数类型。

Though generally the smaller integer types of course take less space than the larger ones, so if you intend to optimize for RAM size, they are to prefer. 虽然通常较小的整数类型当然比较大的整数类型占用更少的空间,所以如果你打算优化RAM大小,他们更喜欢。

The first answer above is great and I copied a small block of it across to the following duplicate (as this is where I ended up first). 上面的第一个答案很棒,我把它的一小部分复制到下面的副本中(因为这是我最先结束的地方)。

Are "char" and "small int" slower than "int"? “char”和“small int”比“int”慢吗?

I'd like to offer the following code which profiles allocating, initializing and doing some arithmetic on the various integer sizes: 我想提供以下代码,分析,初始化和对各种整数大小进行一些算术运算:

#include <iostream>

#include <windows.h>

using std::cout; using std::cin; using std::endl;

LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;

void inline showElapsed(const char activity [])
{
    QueryPerformanceCounter(&EndingTime);
    ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
    ElapsedMicroseconds.QuadPart *= 1000000;
    ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
    cout << activity << " took: " << ElapsedMicroseconds.QuadPart << "us" << endl;
}

int main()
{
    cout << "Hallo!" << endl << endl;

    QueryPerformanceFrequency(&Frequency);

    const int32_t count = 1100100;
    char activity[200];

    //-----------------------------------------------------------------------------------------//
    sprintf_s(activity, "Initialise & Set %d 8 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    int8_t *data8 = new int8_t[count];
    for (int i = 0; i < count; i++)
    {
        data8[i] = i;
    }
    showElapsed(activity);

    sprintf_s(activity, "Add 5 to %d 8 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    for (int i = 0; i < count; i++)
    {
        data8[i] = i + 5;
    }
    showElapsed(activity);
    cout << endl;
    //-----------------------------------------------------------------------------------------//

    //-----------------------------------------------------------------------------------------//
    sprintf_s(activity, "Initialise & Set %d 16 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    int16_t *data16 = new int16_t[count];
    for (int i = 0; i < count; i++)
    {
        data16[i] = i;
    }
    showElapsed(activity);

    sprintf_s(activity, "Add 5 to %d 16 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    for (int i = 0; i < count; i++)
    {
        data16[i] = i + 5;
    }
    showElapsed(activity);
    cout << endl;
    //-----------------------------------------------------------------------------------------//

    //-----------------------------------------------------------------------------------------//    
    sprintf_s(activity, "Initialise & Set %d 32 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    int32_t *data32 = new int32_t[count];
    for (int i = 0; i < count; i++)
    {
        data32[i] = i;
    }
    showElapsed(activity);

    sprintf_s(activity, "Add 5 to %d 32 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    for (int i = 0; i < count; i++)
    {
        data32[i] = i + 5;
    }
    showElapsed(activity);
    cout << endl;
    //-----------------------------------------------------------------------------------------//

    //-----------------------------------------------------------------------------------------//
    sprintf_s(activity, "Initialise & Set %d 64 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    int64_t *data64 = new int64_t[count];
    for (int i = 0; i < count; i++)
    {
        data64[i] = i;
    }
    showElapsed(activity);

    sprintf_s(activity, "Add 5 to %d 64 bit integers", count);
    QueryPerformanceCounter(&StartingTime);

    for (int i = 0; i < count; i++)
    {
        data64[i] = i + 5;
    }
    showElapsed(activity);
    cout << endl;
    //-----------------------------------------------------------------------------------------//

    getchar();
}


/*
My results on i7 4790k:

Initialise & Set 1100100 8 bit integers took: 444us
Add 5 to 1100100 8 bit integers took: 358us

Initialise & Set 1100100 16 bit integers took: 666us
Add 5 to 1100100 16 bit integers took: 359us

Initialise & Set 1100100 32 bit integers took: 870us
Add 5 to 1100100 32 bit integers took: 276us

Initialise & Set 1100100 64 bit integers took: 2201us
Add 5 to 1100100 64 bit integers took: 659us
*/

My results in MSVC on i7 4790k: 我在i7 4790k上的MSVC结果:

Initialise & Set 1100100 8 bit integers took: 444us Initialise&Set 1100100 8位整数:444us
Add 5 to 1100100 8 bit integers took: 358us 添加5到1100100 8位整数:358us

Initialise & Set 1100100 16 bit integers took: 666us 初始化和设置1100100 16位整数:666us
Add 5 to 1100100 16 bit integers took: 359us 添加5到1100100 16位整数:359us

Initialise & Set 1100100 32 bit integers took: 870us 初始化和设置1100100 32位整数:870us
Add 5 to 1100100 32 bit integers took: 276us 添加5到1100100 32位整数:276us

Initialise & Set 1100100 64 bit integers took: 2201us 初始化和设置1100100 64位整数:2201us
Add 5 to 1100100 64 bit integers took: 659us 添加5到1100100 64位整数:659us

Is there any performance difference between integral arithmetics and floating-point arithmetic? 整数算术和浮点运算之间是否有任何性能差异?

Yes. 是。 However, this is very much platform and CPU specific. 但是,这是非常平台和CPU特定的。 Different platforms can do different arithmetic operations at different speeds. 不同平台可以以不同的速度执行不同的算术运算。

That being said, the reply in question was a bit more specific. 话虽如此,有问题的答复有点具体。 pow() is a general purpose routine that works on double values. pow()是一个通用的例程,适用于double值。 By feeding it integer values, it's still doing all of the work that would be required to handle non-integer exponents. 通过提供整数值,它仍然可以完成处理非整数指数所需的所有工作。 Using direct multiplication bypasses a lot of the complexity, which is where the speed comes into play. 使用直接乘法会绕过很多复杂性,这就是速度发挥作用的地方。 This is really not an issue (so much) of different types, but rather of bypassing a large amount of complex code required to make pow function with any exponent. 这实际上不是一个(不是很多)不同类型的问题,而是绕过了使用任何指数制作pow函数所需的大量复杂代码。

Depends on the composition of the processor and platform. 取决于处理器和平台的组成。

Platforms that have a floating point coprocessor may be slower than integral arithmetic due to the fact that values have to be transferred to and from the coprocessor. 具有浮点协处理器的平台可能比积分算术慢,因为必须将值传送到协处理器和从协处理器传送值。

If floating point processing is within the core of the processor, the execution time may be negligible. 如果浮点处理在处理器的核心内,则执行时间可以忽略不计。

If the floating point calculations are emulated by software, then integral arithmetic will be faster. 如果浮点计算由软件模拟,那么积分算法将更快。

When in doubt, profile. 如有疑问,请查看。

Get the programming working correctly and robust before optimizing. 在优化之前使编程正常且稳健。

No, not really. 不,不是真的。 This of course depends on CPU and compiler, but the performance difference is typically negligible- if there even is any. 这当然取决于CPU和编译器,但性能差异通常可以忽略不计 - 如果有的话。

There is certainly a difference between floating point and integer arithmetic. 浮点和整数运算之间肯定存在差异。 Depending on the CPU's specific hardware and micro-instructions, you get different performance and/or precision. 根据CPU的特定硬件和微指令,您可以获得不同的性能和/或精度。 Good google terms for the precise descriptions (I don't know exactly either): 精确描述的好谷歌术语(我也不确切):

FPU x87 MMX SSE FPU x87 MMX SSE

With regards to the size of the integers, it is best to use the platform/architecture word size (or double that), which comes down to an int32_t on x86 and int64_t on x86_64. 关于整数的大小,最好使用平台/体系结构字大小(或两倍),它在x86上为int32_t ,在x86_64上为int64_t SOme processors might have intrinsic instructions that handle several of these values at once (like SSE (floating point) and MMX), which will speed up parallel additions or multiplications. SOme处理器可能具有一次性处理这些值的内部指令(如SSE(浮点)和MMX),这将加速并行加法或乘法。

Generally, integer math is faster than floating-point math. 通常,整数数学比浮点数学更快。 This is because integer math involves simpler computations. 这是因为整数数学涉及更简单的计算。 However, in most operations we're talking about less than a dozen clocks. 但是,在大多数操作中,我们谈论的是不到十几个时钟。 Not millis, micros, nanos, or ticks; 不是毫克,微米,纳米或蜱; clocks. 时钟。 The ones that happen between 2-3 billion times per second in modern cores. 在现代核心中每秒发生2-3亿次的情况。 Also, since the 486 a lot of cores have a set of Floating-point Processing Units or FPUs, which are hard-wired to perform floating-point arithmetic efficiently, and often in parallel with the CPU. 此外,由于486许多内核具有一组浮点处理单元或FPU,它们通过硬连线有效地执行浮点运算,并且通常与CPU并行。

As a result of these, though technically it's slower, floating-point calculations are still so fast that any attempt to time the difference would have more error inherent in the timing mechanism and thread scheduling than it actually takes to perform the calculation. 由于这些原因,虽然技术上它的速度较慢,但​​浮点计算仍然如此之快,以至于任何计时差异的尝试都会在计时机制和线程调度中产生比实际执行计算所需的更多错误。 Use ints when you can, but understand when you can't, and don't worry too much about relative calculation speed. 尽可能使用int,但在不能的时候理解,并且不要过分担心相对计算速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM