Why are complicated memcpy/memset superior?

Question

When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. These are usually implemented using streaming instructions if available, loop unrolled, alignment optimized, etc... I also recently encountered this 'bug' due to memcpy optimization in glibc .

The question is: why can't the hardware manufacturers (Intel, AMD) optimize the specific case of

rep stos

and

rep movs

to be recognized as such, and do the fastest fill and copy as possible on their own architecture?

Answer 1

Cost.

The cost of optimizing memcpy in your C library is fairly minimal, maybe a few weeks of developer time here and there. You'll have to make a new version every several years or so when processor features change enough to warrant a rewrite. For example, GNU's glibc and Apple's libSystem both have a memcpy which is specifically optimized for SSE3.

The cost of optimizing in hardware is much higher. Not only is it more expensive in terms of developer costs (designing a CPU is vastly more difficult than writing user-space assembly code), but it would increase the transistor count of the processor. That could have a number of negative effects:

Increased power consumption
Increased unit cost
Increased latency for certain CPU subsystems
Lower maximum clock speed

In theory, it could have an overall negative impact on both performance and unit cost.

Maxim: Don't do it in hardware if the software solution is good enough.

Note: The bug you've cited is not really a bug in glibc wrt the C specification. It's more complicated. Basically, the glibc folks say that memcpy behaves exactly as advertised in the standard, and some other folks are complaining that memcpy should be aliased to memmove .

Time for a story: It reminds me of a complaint that a Mac game developer had when he ran his game on a 603 processor instead of a 601 (this is from the 1990s). The 601 had hardware support for unaligned loads and stores with minimal performance penalty. The 603 simply generated an exception; by offloading to the kernel I imagine the load/store unit could be made much simpler, possibly making the processor faster and cheaper in the process. The Mac OS nanokernel handled the exception by performing the required load/store operation and returning control to the process.

But this developer had a custom blitting routine to write pixels to the screen which did unaligned loads and stores. Game performance was fine on the 601 but abominable on the 603. Most other developers didn't notice if they used Apple's blitting function, since Apple could just reimplement it for newer processors.

The moral of the story is that better performance comes both from software and hardware improvements.

In general, the trend seems to be in the opposite direction from the kind of hardware optimizations mentioned. While in x86 it's easy to write memcpy in assembly, some newer architectures offload even more work to software. Of particular note are the VLIW architectures: Intel IA64 (Itanium), the TI TMS320C64x DSPs, and the Transmeta Efficeon are examples. With VLIW, assembly programming gets much more complicated: you have to explicitly select which execution units get which commands and which commands can be done at the same time, something that a modern x86 will do for you (unless it's an Atom). So writing memcpy suddenly gets much, much harder.

These architectural tricks allow you to cut a huge chunk of hardware out of your microprocessors while retaining the performance benefits of a superscalar design. Imagine having a chip with a footprint closer to an Atom but performance closer to a Xeon. I suspect the difficulty of programming these devices are is the major factor impeding wider adoption.

Answer 2

One thing I'd like to add to the other answers is that rep movs is not actually slow on all modern processors. For instance,

Usually, the REP MOVS instruction has a large overhead for choosing and setting up the right method. Therefore, it is not optimal for small blocks of data. For large blocks of data, it may be quite efficient when certain conditions for alignment etc. are met. These conditions depend on the specific CPU (see page 143). On Intel Nehalem and Sandy Bridge processors, this is the fastest method for moving large blocks of data , even if the data are unaligned.

[Highlighting is mine.] Reference: Agner Fog, Optimizing subroutines in assembly language An optimization guide for x86 platforms. ,p. 156 (and see also section 16.10, p. 143) [version of 2011-06-08].

Answer 3

General Purpose vs. Specialized

One factor is that those instructions (rep prefix/string instructions) are general purpose, so they'll handle any alignment, any number of bytes or words and they'll have certain behavior relative to the cache and or state of registers etc. ie well defined side effects that can't be changed.

The specialized memory copy may only work for certain alignments, sizes, and may have different behavior vs. the cache.

The hand written assembly (either in the libary or one developers may implement themselves) may outpeform the string instruction implementation for the special cases where it is used. Compilers will often have several memcpy implementations for special cases and then the developer may have a "very special" case where they roll their own.

It doesn't make sense to do this specialization at the hardware level. Too much complexity (= cost).

The law of diminishing returns

Another way to think about it is that when new features are introduced, eg SSE, the designers make architectural changes to support these features, eg a wider or higher bandwidth memory interface, changes to the pipeline, new execution units, etc. The designer is unlikely at this point to go back to the "legacy" portion of the design to try and bring it up to speed to the latest features. That would kind of be counter-productive. If you follow this philosophy you may ask why do we need SIMD in the first place, can't the designer just make the narrow instructions work as fast as SIMD for those cases where someone uses SIMD? The answer is usually that it's not worth it because it is easier to throw in a new execution unit or instructions.

Answer 4

In embedded systems, it's common to have specialized hardware that does memcpy/memset. It's not normally done as a special CPU instruction, rather it's a DMA peripheral that sits on the memory bus. You write a couple of registers to tell it the addresses, and HW does the rest. It doesn't really warrant a special CPU instruction since it's really just a memory interface issue that doesn't really need to involve the CPU.

Answer 5

If it aint broke dont fix it. It aint broke.

A primary problem is unaligned accesses. They go from bad to really bad depending on what architecture you are running on. A lot of it has to do with the programmers, some with the compilers.

The cheapest way to fix memcpy is to not use it, keep your data aligned on nice boundaries and use or make an alternate to memcpy that only supports nice aligned, block copies. Even better would be to have a compiler switch to sacrifice program space and ram for the sake of speed. folks or languages that use a lot of structures such that the compiler internally generates calls to memcpy or whatever that language equivalent is would have their structures grow such that there is a pad between or padding inside. A 59 byte structure may become 64 bytes instead. malloc or an alternative that only gives pointers to an address aligned as specified. etc etc.

It is considerably easier to just do all of this yourself. An aligned malloc, structures that are multiples of the alignement size. Your own memcpy that is aligned, etc. with it being that easy why would the hardware folks mess up their designs and compilers and users? there is no business case for it.

Another reason is that caches have changed the picture. your dram is only accessible in a fixed size, 32 bits 64 bits, something like that, any direct accesses smaller than that are a huge performance hit. Put the cache in front of that the performance hit goes way down, any read-modify-write happens in the cache with the modify allowing for mulitple modifies for a single read and write of dram. You still want to reduce the number of memory cycles to the cache, yes, and you can still see the performance gain by smoothing that out with the gear shift thing (8 bit first gear, 16 bit second gear, 32 bit third gear, 64 bit cruising speed, 32 bit shift down, 16 bit shift down, 8 bit shift down)

I cant speak for intel but do know that folks like ARM have done what you are asking a

ldmia r0!,{r2,r3,r4,r5}

for example is still four 32 bit transfers if the core uses a 32 bit interface. but for 64 bit interfaces if aligned on a 64 bit boundry it becomes a 64 bit transfer with a length of two, one set of negotiations between the parties and two 64 bit words move. If not aligned on a 64 bit boundary then it becomes three transfers a single 32 bit, a single 64 bit then a single 32 bit. You have to be careful, if these are hardware registers that may not work depending on the design of the register logic, if it only supports single 32 bit transfers you cant use that instruction against that address space. No clue why you would try something like that anyway.

The last comment is...it hurts when I do this...well dont do that. Dont single step into memory copies. the corollary to that is there is no way anyone would modify the design of the hardware to make single stepping a memory copy easier on the user, that use case is so small it doesnt exist. Take all the computers using that processor running at full speed day and night, measured against all the computers being single stepped through mem copies and other performance optimized code. It is like comparing a grain of sand to the width of the earth. If you are single stepping, you are still going to have to single step through whatever the new solution is if there were one. to avoid huge interrupt latencies the hand tuned memcpy will still start with an if-then-else (if too small of a copy just go into a small set of unrolled code or a byte copy loop) then go into a series of block copies at some optimal speed without horrible latency size. You will still have to single step through that.

to do single stepping debugging you have to compile screwed up, slow, code anyway, the easiest way to solve a single step through memcpy problem, is to have the compiler and linker when told to build for debug, build for and link against a non-optimized memcpy or an alternate non-optimized library in general. gnu/gcc and llvm are open source, you can make them do whatever you want.

Answer 6

Once upon a time rep movsb was the optimal solution.

The original IBM PC had an 8088 processor with an 8-bit data bus and no caches. Then the fastest program was generally the one with the fewest number of instruction bytes. Having special instructions helped.

Nowadays, the fastest program is the one that can use as many CPU features as possible in parallel. Strange as it might seem at first, having code with many simple instructions can actually run faster than a single do-it-all instruction.

Intel and AMD keep the old instructions around mainly for backward compatibility.

Why are complicated memcpy/memset superior?

Question

6 answers

solution1
24 ACCPTED 2012-01-14 00:28:33

solution2
15 2012-02-07 13:47:32

solution3
5 2012-01-14 00:16:00

solution4
1 2012-01-14 01:44:31

solution5
1 2012-01-14 03:56:16

solution6
1 2012-01-14 14:47:22

Why are complicated memcpy/memset superior?

Question

6 answers

solution1 24 ACCPTED 2012-01-14 00:28:33

solution2 15 2012-02-07 13:47:32

solution3 5 2012-01-14 00:16:00

solution4 1 2012-01-14 01:44:31

solution5 1 2012-01-14 03:56:16

solution6 1 2012-01-14 14:47:22

solution1
24 ACCPTED 2012-01-14 00:28:33

solution2
15 2012-02-07 13:47:32

solution3
5 2012-01-14 00:16:00

solution4
1 2012-01-14 01:44:31

solution5
1 2012-01-14 03:56:16

solution6
1 2012-01-14 14:47:22