简体   繁体   English

为什么复杂的memcpy / memset优越?

[英]Why are complicated memcpy/memset superior?

When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. 在调试时,我经常进入memcpy和memset的手写汇编实现。 These are usually implemented using streaming instructions if available, loop unrolled, alignment optimized, etc... I also recently encountered this 'bug' due to memcpy optimization in glibc . 这些通常使用流指令(如果可用),循环展开,对齐优化等来实现...我最近也遇到了由于glibc中的memcpy优化而导致的“错误”

The question is: why can't the hardware manufacturers (Intel, AMD) optimize the specific case of 问题是:为什么硬件制造商(英特尔,AMD)不能优化具体情况

rep stos

and

rep movs

to be recognized as such, and do the fastest fill and copy as possible on their own architecture? 被认可,并尽可能快地填写和复制他们自己的架构?

Cost. 成本。

The cost of optimizing memcpy in your C library is fairly minimal, maybe a few weeks of developer time here and there. 在C库中优化memcpy的成本相当低,可能需要几周的开发人员时间。 You'll have to make a new version every several years or so when processor features change enough to warrant a rewrite. 当处理器功能发生变化以保证重写时,您必须每隔几年左右制作一个新版本。 For example, GNU's glibc and Apple's libSystem both have a memcpy which is specifically optimized for SSE3. 例如,GNU的glibc和Apple的libSystem都有一个memcpy ,专门针对SSE3进行了优化。

The cost of optimizing in hardware is much higher. 硬件优化的成本要高得多。 Not only is it more expensive in terms of developer costs (designing a CPU is vastly more difficult than writing user-space assembly code), but it would increase the transistor count of the processor. 它不仅在开发人员成本方面更加昂贵(设计CPU比编写用户空间汇编代码要困难得多),但它会增加处理器的晶体管数量。 That could have a number of negative effects: 这可能会产生一些负面影响:

  • Increased power consumption 功耗增加
  • Increased unit cost 增加单位成本
  • Increased latency for certain CPU subsystems 增加某些CPU子系统的延迟
  • Lower maximum clock speed 降低最大时钟速度

In theory, it could have an overall negative impact on both performance and unit cost. 从理论上讲,它可能对性能和单位成本产生总体负面影响。

Maxim: Don't do it in hardware if the software solution is good enough. Maxim:如果软件解决方案足够好,请不要在硬件中使用它。

Note: The bug you've cited is not really a bug in glibc wrt the C specification. 注意:您引用的错误并不是glibc与C规范中的错误。 It's more complicated. 它更复杂。 Basically, the glibc folks say that memcpy behaves exactly as advertised in the standard, and some other folks are complaining that memcpy should be aliased to memmove . 基本上,glibc人员说memcpy行为与标准中的广告完全一样,而其他一些人则抱怨memcpy应该为memmove别名。

Time for a story: It reminds me of a complaint that a Mac game developer had when he ran his game on a 603 processor instead of a 601 (this is from the 1990s). 故事的时间:它让我想起了一个Mac游戏开发者在603处理器而不是601(这是从20世纪90年代)运行游戏时的抱怨。 The 601 had hardware support for unaligned loads and stores with minimal performance penalty. 601具有对未对齐的负载和存储的硬件支持,性能损失最小。 The 603 simply generated an exception; 603只是产生了一个例外; by offloading to the kernel I imagine the load/store unit could be made much simpler, possibly making the processor faster and cheaper in the process. 通过卸载到内核我想象加载/存储单元可以变得更加简单,可能使处理器更快,更便宜。 The Mac OS nanokernel handled the exception by performing the required load/store operation and returning control to the process. Mac OS超微内核通过执行所需的加载/存储操作并将控制权返回给进程来处理异常。

But this developer had a custom blitting routine to write pixels to the screen which did unaligned loads and stores. 但是这个开发人员有一个自定义的blitting例程,可以将像素写入屏幕,从而完成未对齐的加载和存储。 Game performance was fine on the 601 but abominable on the 603. Most other developers didn't notice if they used Apple's blitting function, since Apple could just reimplement it for newer processors. 601上的游戏性能很好,但是在603上是可恶的。大多数其他开发人员都没有注意到他们是否使用了Apple的blitting功能,因为Apple可能会为新的处理器重新实现它。

The moral of the story is that better performance comes both from software and hardware improvements. 故事的寓意是,软件和硬件改进都会带来更好的性能。

In general, the trend seems to be in the opposite direction from the kind of hardware optimizations mentioned. 总的来说,这种趋势似乎与所提到的硬件优化方向相反。 While in x86 it's easy to write memcpy in assembly, some newer architectures offload even more work to software. 虽然在x86中很容易在汇编中编写memcpy ,但是一些较新的架构会将更多工作卸载到软件中。 Of particular note are the VLIW architectures: Intel IA64 (Itanium), the TI TMS320C64x DSPs, and the Transmeta Efficeon are examples. 特别值得注意的是VLIW架构:Intel IA64(Itanium),TI TMS320C64x DSP和Transmeta Efficeon就是例子。 With VLIW, assembly programming gets much more complicated: you have to explicitly select which execution units get which commands and which commands can be done at the same time, something that a modern x86 will do for you (unless it's an Atom). 使用VLIW,汇编编程变得更加复杂:您必须明确选择哪些执行单元可以同时执行哪些命令和哪些命令,这是现代x86将为您做的事情(除非它是Atom)。 So writing memcpy suddenly gets much, much harder. 所以写memcpy突然变得更加困难。

These architectural tricks allow you to cut a huge chunk of hardware out of your microprocessors while retaining the performance benefits of a superscalar design. 这些架构技巧允许您从微处理器中切割出大量硬件,同时保留超标量设计的性能优势。 Imagine having a chip with a footprint closer to an Atom but performance closer to a Xeon. 想象一下,芯片的占地面积更接近Atom但性能更接近Xeon。 I suspect the difficulty of programming these devices are is the major factor impeding wider adoption. 我怀疑编程这些设备的难度是阻碍更广泛采用的主要因素。

One thing I'd like to add to the other answers is that rep movs is not actually slow on all modern processors. 我想在其他答案中添加的一件事是,在所有现代处理器上, rep movs实际上并不慢。 For instance, 例如,

Usually, the REP MOVS instruction has a large overhead for choosing and setting up the right method. 通常,REP MOVS指令在选择和设置正确方法时有很大的开销。 Therefore, it is not optimal for small blocks of data. 因此,它对于小数据块不是最佳的。 For large blocks of data, it may be quite efficient when certain conditions for alignment etc. are met. 对于大块数据,当满足对齐等的某些条件时,它可能非常有效。 These conditions depend on the specific CPU (see page 143). 这些条件取决于特定的CPU(参见第143页)。 On Intel Nehalem and Sandy Bridge processors, this is the fastest method for moving large blocks of data , even if the data are unaligned. 在Intel Nehalem和Sandy Bridge处理器上,这是移动大块数据的最快方法 ,即使数据未对齐。

[Highlighting is mine.] Reference: Agner Fog, Optimizing subroutines in assembly language An optimization guide for x86 platforms. [突出显示是我的。]参考: Agner Fog,用汇编语言优化子程序x86平台的优化指南。 ,p. 页。 156 (and see also section 16.10, p. 143) [version of 2011-06-08]. 156(另见第16.10节,第143页)[2011-06-08版]。

General Purpose vs. Specialized 通用与专业

One factor is that those instructions (rep prefix/string instructions) are general purpose, so they'll handle any alignment, any number of bytes or words and they'll have certain behavior relative to the cache and or state of registers etc. ie well defined side effects that can't be changed. 一个因素是那些指令(rep前缀/字符串指令)是通用的,因此它们将处理任何对齐,任意数量的字节或字,并且它们将具有相对于高速缓存和/或寄存器状态等的某些行为。明确无法改变的副作用。

The specialized memory copy may only work for certain alignments, sizes, and may have different behavior vs. the cache. 专用内存副本可能仅适用于某些对齐,大小,并且可能与缓存有不同的行为。

The hand written assembly (either in the libary or one developers may implement themselves) may outpeform the string instruction implementation for the special cases where it is used. 手写程序集(在库中或者一个开发人员可能自己实现)可能会在使用它的特殊情况下超出字符串指令实现。 Compilers will often have several memcpy implementations for special cases and then the developer may have a "very special" case where they roll their own. 对于特殊情况,编译器通常会有几个memcpy实现,然后开发人员可能会有一个“非常特殊”的情况,他们自己推出。

It doesn't make sense to do this specialization at the hardware level. 在硬件级别进行此专业化没有意义。 Too much complexity (= cost). 太复杂(=成本)。

The law of diminishing returns 收益递减规律

Another way to think about it is that when new features are introduced, eg SSE, the designers make architectural changes to support these features, eg a wider or higher bandwidth memory interface, changes to the pipeline, new execution units, etc. The designer is unlikely at this point to go back to the "legacy" portion of the design to try and bring it up to speed to the latest features. 另一种思考方式是,当引入新功能(例如SSE)时,设计人员进行架构更改以支持这些功能,例如更宽或更高带宽的存储器接口,管道更改,新执行单元等。设计人员是此时不太可能回到设计的“遗留”部分,试图让它加速到最新的功能。 That would kind of be counter-productive. 这会产生适得其反的效果。 If you follow this philosophy you may ask why do we need SIMD in the first place, can't the designer just make the narrow instructions work as fast as SIMD for those cases where someone uses SIMD? 如果您遵循这一理念,您可能会问我们为什么首先需要SIMD,对于有人使用SIMD的情况,设计师难道不能让狭窄的指令像SIMD一样快速工作吗? The answer is usually that it's not worth it because it is easier to throw in a new execution unit or instructions. 答案通常是不值得,因为更容易投入新的执行单元或指令。

In embedded systems, it's common to have specialized hardware that does memcpy/memset. 在嵌入式系统中,通常使用具有memcpy / memset的专用硬件。 It's not normally done as a special CPU instruction, rather it's a DMA peripheral that sits on the memory bus. 它通常不是作为特殊的CPU指令完成的,而是一个位于内存总线上的DMA外设。 You write a couple of registers to tell it the addresses, and HW does the rest. 你写了几个寄存器来告诉它地址,HW完成剩下的工作。 It doesn't really warrant a special CPU instruction since it's really just a memory interface issue that doesn't really need to involve the CPU. 它并不真正保证特殊的CPU指令,因为它实际上只是一个内存接口问题,并不真正需要CPU。

If it aint broke dont fix it. 如果它没有破坏不修复它。 It aint broke. 它没有破产。

A primary problem is unaligned accesses. 主要问题是未对齐的访问。 They go from bad to really bad depending on what architecture you are running on. 根据您运行的体系结构,它们会从糟糕变为非常糟糕。 A lot of it has to do with the programmers, some with the compilers. 很多都与程序员有关,有些与编译器有关。

The cheapest way to fix memcpy is to not use it, keep your data aligned on nice boundaries and use or make an alternate to memcpy that only supports nice aligned, block copies. 修复memcpy最便宜的方法是不使用它,保持数据在良好的边界上对齐,并使用或制作只支持良好对齐的块副本的memcpy的替代方法。 Even better would be to have a compiler switch to sacrifice program space and ram for the sake of speed. 更好的方法是让编译器切换为了速度而牺牲程序空间和ram。 folks or languages that use a lot of structures such that the compiler internally generates calls to memcpy or whatever that language equivalent is would have their structures grow such that there is a pad between or padding inside. 使用大量结构的人或语言,以便编译器在内部生成对memcpy的调用,或者等效的语言将使其结构增长,以便在内部填充或填充内部。 A 59 byte structure may become 64 bytes instead. 59字节结构可能变为64字节。 malloc or an alternative that only gives pointers to an address aligned as specified. malloc或只提供指向指定对齐的地址的指针的替代方法。 etc etc. 等等

It is considerably easier to just do all of this yourself. 自己完成所有这些操作要容易得多。 An aligned malloc, structures that are multiples of the alignement size. 对齐的malloc,结构是对齐大小的倍数。 Your own memcpy that is aligned, etc. with it being that easy why would the hardware folks mess up their designs and compilers and users? 你自己的memcpy是一致的,因为它很容易为什么硬件人会搞乱他们的设计和编译器和用户? there is no business case for it. 它没有商业案例。

Another reason is that caches have changed the picture. 另一个原因是缓存已经改变了画面。 your dram is only accessible in a fixed size, 32 bits 64 bits, something like that, any direct accesses smaller than that are a huge performance hit. 你的dram只能以固定大小,32位64位访问,类似的东西,任何小于此的直接访问都是一个巨大的性能影响。 Put the cache in front of that the performance hit goes way down, any read-modify-write happens in the cache with the modify allowing for mulitple modifies for a single read and write of dram. 将缓存放在前面,性能命中率下降,任何读取 - 修改 - 写入都发生在缓存中,修改允许多次修改以进行单次读取和写入dram。 You still want to reduce the number of memory cycles to the cache, yes, and you can still see the performance gain by smoothing that out with the gear shift thing (8 bit first gear, 16 bit second gear, 32 bit third gear, 64 bit cruising speed, 32 bit shift down, 16 bit shift down, 8 bit shift down) 您仍然希望减少缓存的内存周期数,是的,您仍然可以通过使用换档功能(8位一档,16位二档,32位三档,64位)来平滑性能增益。位巡航速度,32位下移,16位下移,8位下移)

I cant speak for intel but do know that folks like ARM have done what you are asking a 我不能说英特尔,但确实知道像ARM这样的人已经完成了你所要求的

ldmia r0!,{r2,r3,r4,r5}

for example is still four 32 bit transfers if the core uses a 32 bit interface. 例如,如果内核使用32位接口,则仍然是四个32位传输。 but for 64 bit interfaces if aligned on a 64 bit boundry it becomes a 64 bit transfer with a length of two, one set of negotiations between the parties and two 64 bit words move. 但是对于64位接口,如果在64位边界上对齐,则变为长度为2的64位传输,各方之间的一组协商和两个64位字移动。 If not aligned on a 64 bit boundary then it becomes three transfers a single 32 bit, a single 64 bit then a single 32 bit. 如果没有在64位边界上对齐,那么它将变成三个传输,一个32位,一个64位,然后是一个32位。 You have to be careful, if these are hardware registers that may not work depending on the design of the register logic, if it only supports single 32 bit transfers you cant use that instruction against that address space. 您必须要小心,如果这些硬件寄存器可能不起作用,具体取决于寄存器逻辑的设计,如果它只支持单个32位传输,则您无法对该地址空间使用该指令。 No clue why you would try something like that anyway. 不知道为什么你会尝试这样的东西。

The last comment is...it hurts when I do this...well dont do that. 最后的评论是......当我这样做时会很痛......好吧不要这样做。 Dont single step into memory copies. 不要单步进入内存副本。 the corollary to that is there is no way anyone would modify the design of the hardware to make single stepping a memory copy easier on the user, that use case is so small it doesnt exist. 这样做的必然结果是,没有人会修改硬件的设计,使用户更容易单步执行内存复制,用例非常小,不存在。 Take all the computers using that processor running at full speed day and night, measured against all the computers being single stepped through mem copies and other performance optimized code. 使用该处理器的所有计算机日夜全速运行,测量所有计算机单步执行mem副本和其他性能优化代码。 It is like comparing a grain of sand to the width of the earth. 这就像比较一粒沙子和地球的宽度。 If you are single stepping, you are still going to have to single step through whatever the new solution is if there were one. 如果您是单步执行,那么无论新解决方案是什么,您仍然需要单步执行。 to avoid huge interrupt latencies the hand tuned memcpy will still start with an if-then-else (if too small of a copy just go into a small set of unrolled code or a byte copy loop) then go into a series of block copies at some optimal speed without horrible latency size. 为了避免巨大的中断延迟,手动调整的memcpy仍将以if-then-else开始(如果太小的副本只是进入一小组展开的代码或字节复制循环),那么就进入一系列的块拷贝一些最佳速度,没有可怕的延迟大小。 You will still have to single step through that. 你仍然需要单步执行。

to do single stepping debugging you have to compile screwed up, slow, code anyway, the easiest way to solve a single step through memcpy problem, is to have the compiler and linker when told to build for debug, build for and link against a non-optimized memcpy or an alternate non-optimized library in general. 做单步执行调试你必须编译搞砸,慢,代码无论如何,通过memcpy问题解决单步的最简单方法,是告诉编译器和链接器建立调试,构建和链接非 - 通常优化的memcpy或备用的非优化库。 gnu/gcc and llvm are open source, you can make them do whatever you want. gnu / gcc和llvm是开源的,你可以让它们做你想做的任何事情。

Once upon a time rep movsb was the optimal solution. 曾几何时, rep movsb 最佳解决方案。

The original IBM PC had an 8088 processor with an 8-bit data bus and no caches. 最初的IBM PC有一个8088处理器,带有8位数据总线,没有缓存。 Then the fastest program was generally the one with the fewest number of instruction bytes. 那么最快的程序通常是指令字节数最少的程序。 Having special instructions helped. 有特别说明有帮助。

Nowadays, the fastest program is the one that can use as many CPU features as possible in parallel. 如今,最快的程序是可以并行使用尽可能多的CPU功能的程序。 Strange as it might seem at first, having code with many simple instructions can actually run faster than a single do-it-all instruction. 一开始可能看起来很奇怪,拥有许多简单指令的代码实际上比单个do-it-all指令运行得更快。

Intel and AMD keep the old instructions around mainly for backward compatibility. 英特尔和AMD保留旧指令主要是为了向后兼容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM