简体   繁体   English

当memcpy()比memmove()更快时,真正的重要案例是什么?

[英]What are real significant cases when memcpy() is faster than memmove()?

The key difference between memcpy() and memmove() is that memmove() will work fine when source and destination overlap. memcpy()memmove()之间的关键区别在于,当源和目标重叠时, memmove()将正常工作。 When buffers surely don't overlap memcpy() is preferable since it's potentially faster. 当缓冲区肯定不重叠时, memcpy()更可取,因为它可能更快。

What bothers me is this potentially . 困扰我的是这个潜在的 Is it a microoptimization or are there real significant examples when memcpy() is faster so that we really need to use memcpy() and not stick to memmove() everywhere? 它是一个微优化还是当memcpy()更快时有真正重要的例子,所以我们真的需要使用memcpy()而不是到处都有memmove()

There's at least an implicit branch to copy either forwards or backwards for memmove() if the compiler is not able to deduce that an overlap is not possible. 如果编译器无法推断出无法重叠,那么至少有一个隐式分支可以向前或向后复制memmove() This means that without the ability to optimize in favor of memcpy() , memmove() is at least slower by one branch, and any additional space occupied by inlined instructions to handle each case (if inlining is possible). 这意味着如果不能优化memcpy()memmove()至少会被一个分支放慢,并且内联指令占用的任何额外空间都可以处理每种情况(如果可以内联)。

Reading the eglibc-2.11.1 code for both memcpy() and memmove() confirms this as suspected. 读取memcpy()memmove()eglibc-2.11.1代码可以确认这一点。 Furthermore, there's no possibility of page copying during backward copying, a significant speedup only available if there's no chance for overlapping. 此外,在向后复制期间不可能进行页面复制,只有在没有重叠的情况下才能获得显着的加速。

In summary this means: If you can guarantee the regions are not overlapped, then selecting memcpy() over memmove() avoids a branch. 总之,这意味着:如果可以保证区域不重叠,那么在memmove()选择memcpy() memmove()可以避免分支。 If the source and destination contain corresponding page aligned and page sized regions, and don't overlap, some architectures can employ hardware accelerated copies for those regions, regardless of whether you called memmove() or memcpy() . 如果源和目标包含相应的页面对齐和页面大小的区域,并且不重叠,则某些体系结构可以为这些区域使用硬件加速副本,无论您是否调用了memmove()memcpy()

Update0 Update0

There is actually one more difference beyond the assumptions and observations I've listed above. 除了我上面列出的假设和观察之外,实际上还有一个区别。 As of C99, the following prototypes exist for the 2 functions: 从C99开始,这两个函数存在以下原型:

void *memcpy(void * restrict s1, const void * restrict s2, size_t n);
void *memmove(void * s1, const void * s2, size_t n);

Due to the ability to assume the 2 pointers s1 and s2 do not point at overlapping memory, straightforward C implementations of memcpy are able to leverage this to generate more efficient code without resorting to assembler, see here for more. 由于能够假设2个指针s1s2没有指向重叠的内存,因此memcpy直接C实现能够利用它来生成更高效的代码,而无需借助汇编程序,请参阅此处了解更多信息。 I'm sure that memmove can do this, however additional checks would be required above those I saw present in eglibc , meaning the performance cost may be slightly more than a single branch for C implementations of these functions. 我确信memmove可以做到这一点,但是我在eglibc看到的那些上面需要额外的检查,这意味着对于这些函数的C实现,性能成本可能略高于单个分支。

At best, calling memcpy rather than memmove will save a pointer comparison and a conditional branch. 充其量,调用memcpy而不是memmove将保存指针比较和条件分支。 For a large copy, this is completely insignificant. 对于大型副本,这是完全无关紧要的。 If you are doing many small copies, then it might be worth measuring the difference; 如果您正在做许多小型副本,那么可能值得衡量差异; that is the only way you can tell whether it's significant or not. 这是唯一可以判断它是否重要的​​方法。

It is definitely a microoptimisation, but that doesn't mean you shouldn't use memcpy when you can easily prove that it is safe. 它绝对是一种微观优化,但这并不意味着当您可以轻松证明它是安全的时候不应该使用memcpy Premature pessimisation is the root of much evil. 过早的悲观情绪是多恶的根源。

Well, memmove has to copy backwards when the source and destination overlap, and the source is before the destination. 好吧, memmove必须在源和目标重叠时向后复制, 并且源位于目标之前。 So, some implementations of memmove simply copy backwards when the source is before the destination, without regard for whether the two regions overlap. 因此, memmove某些实现只是在源位于目标之前时向后复制,而不考虑这两个区域是否重叠。

A quality implementation of memmove can detect whether the regions overlap, and do a forward-copy when they don't. memmove的高质量实现可以检测区域是否重叠,并在不执行时进行前向复制。 In such a case, the only extra overhead compared to memcpy is simply the overlap checks. 在这种情况下,与memcpy相比,唯一的额外开销就是重叠检查。

Simplistically, memmove needs to test for overlap and then do the appropriate thing; 简单地说, memmove需要测试重叠然后做适当的事情; with memcpy , one asserts that there is not overlap so no need for additional tests. 使用memcpy ,一个断言没有重叠,因此不需要额外的测试。

Having said that, I have seen platforms that have exactly the same code for memcpy and memmove . 话虽如此,我已经看到了具有完全相同的memcpymemmove代码的平台。

It's certainly possible that memcpy is merely a call to memmove , in which case there'd be no benefit to using memcpy . memcpy当然可能仅仅是对memmove的调用,在这种情况下使用memcpy没有任何好处。 On the other extreme, it's possible that an implementor assumed memmove would rarely be used, and implemented it with the simplest possible byte-at-a-time loops in C, in which case it could be ten times slower than an optimized memcpy . 另一方面,实现者可能很少使用memmove ,并且在C中使用最简单的一次一个字节循环来实现它,在这种情况下,它可能比优化的memcpy慢十倍。 As others have said, the likeliest case is that memmove uses memcpy when it detects that a forward copy is possible, but some implementations may simply compare the source and destination addresses without looking for overlap. 正如其他人所说,最有可能的情况是memmove在检测到正向拷贝可能时使用memcpy ,但是某些实现可能只是比较源地址和目标地址而不寻找重叠。

With that said, I would recommend never using memmove unless you're shifting data within a single buffer. 话虽如此,我建议永远不要使用memmove除非你在一个缓冲区内移动数据。 It might not be slower, but then again, it might be, so why risk it when you know there's no need for memmove ? 它可能不会慢,但话又说回来,那么为什么当你知道不需要memmove时冒险呢?

Just simplify and always use memmove . 只需简化并始终使用memmove A function that's right all the time is better than a function that's only right half the time. 一直都是正确的功能比只有一半时间的功能更好。

It is entirely possible that in most implementations, the cost of a memmove() function call will not be significantly greater than memcpy() in any scenario in which the behavior of both is defined. 完全有可能在大多数实现中,memmove()函数调用的成本在定义两者行为的任何场景中都不会比memcpy()大得多。 There are two points not yet mentioned, though: 但是,有两点尚未提及:

  1. In some implementations, the determination of address overlap may be expensive. 在一些实现中,地址重叠的确定可能是昂贵的。 There is no way in standard C to determine whether the source and destination objects point to the same allocated area of memory, and thus no way that the greater-than or less-than operators can be used upon them without spontaneously causing cats and dogs to get along with each other (or invoking other Undefined Behavior). 在标准C中无法确定源和目标对象是否指向相同的内存分配区域,因此无法使用大于或小于运算符而不会自发地导致猫和狗彼此相处(或调用其他未定义的行为)。 It is likely that any practical implementation will have some efficient means of determining whether or not the pointers overlap, but the standard doesn't require that such a means exist. 任何实际实现都可能具有一些确定指针是否重叠的有效方法,但是标准不要求存在这样的方法。 A memmove() function written entirely in portable C would on many platforms probably take at least twice as long to execute as would a memcpy() also written entirely in portable C. 完全用可移植C编写的memmove()函数在许多平台上执行可能需要至少两倍的时间来执行,而memcpy()也完全用便携式C编写。
  2. Implementations are allowed to expand functions in-line when doing so would not alter their semantics. 允许实现在线扩展函数,这样做不会改变它们的语义。 On an 80x86 compiler, if the ESI and EDI registers don't happen to hold anything important, a memcpy(src, dest, 1234) could generate code: 在80x86编译器上,如果ESI和EDI寄存器没有发生任何重要的事情,memcpy(src,dest,1234)可以生成代码:
    \n  mov esi,[src] mov esi,[src]\n  mov edi,[dest] mov edi,[dest]\n  mov ecx,1234/4 ; mov ecx,1234/4; Compiler could notice it's a constant 编译器可能会注意到它是一个常数\n  cld CLD\n  rep movsl rep movsl\n
    This would take the same amount of in-line code, but run much faster than: 这将采用相同数量的内联代码,但运行速度比:
    \n  push [src] 推[src]\n  push [dest] 推[dest]\n  push dword 1234 推dword 1234\n  call _memcpy 打电话给_memcpy\n\n  ... ...\n\n_memcpy: _memcpy:\n  push ebp 推ebp\n  mov ebp,esp mov ebp,尤其是\n  mov ecx,[ebp+numbytes] mov ecx,[ebp + numbytes]\n  test ecx,3 ; 测试ecx,3; See if it's a multiple of four 看看它是否是四的倍数\n  jz multiple_of_four jz multiple_of_four\n\nmultiple_of_four: multiple_of_four:\n  push esi ; 推esi; Can't know if caller needs this value preserved 无法知道调用者是否需要保留此值\n  push edi ; 推edi; Can't know if caller needs this value preserved 无法知道调用者是否需要保留此值\n  mov esi,[ebp+src] mov esi,[ebp + src]\n  mov edi,[ebp+dest] mov edi,[ebp + dest]\n  rep movsl rep movsl\n  pop edi pop edi\n  pop esi 流行esi\n  ret RET  \n

Quite a number of compilers will perform such optimizations with memcpy(). 相当多的编译器将使用memcpy()执行此类优化。 I don't know of any that will do it with memmove, although in some cases an optimized version of memcpy may offer the same semantics as memmove. 虽然在某些情况下memcpy的优化版本可能提供与memmove相同的语义,但我不知道有任何与memmove有关的内容。 For example, if numbytes was 20: 例如,如果numbytes为20:

; Assuming values in eax, ebx, ecx, edx, esi, and edi are not needed
  mov esi,[src]
  mov eax,[esi]
  mov ebx,[esi+4]
  mov ecx,[esi+8]
  mov edx,[esi+12]
  mov edi,[esi+16]
  mov esi,[dest]
  mov [esi],eax
  mov [esi+4],ebx
  mov [esi+8],ecx
  mov [esi+12],edx
  mov [esi+16],edi

This will work correctly even if the address ranges overlap, since it effectively makes a copy (in registers) of the entire region to be moved before any of it is written. 即使地址范围重叠,这也将正常工作,因为它有效地使整个区域的副本(在寄存器中)在其中任何一个被写入之前被移动。 In theory, a compiler could process memmove() by seeing if treading it as memcpy() would yield an implementation that would be safe even if the address ranges overlap, and call _memmove in those cases where substituting the memcpy() implementation would not be safe. 从理论上讲,编译器可以处理memmove(),看看是否将其作为memcpy()生成即使地址范围重叠也会安全的实现,并且在替换memcpy()实现的情况下调用_memmove安全。 I don't know of any that do such optimization, though. 不过,我不知道有任何优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM