[英]Why is memcmp so much faster than a for loop check?
Why is memcmp(a, b, size)
so much faster than:为什么
memcmp(a, b, size)
比:
for(i = 0; i < nelements; i++) {
if a[i] != b[i] return 0;
}
return 1;
Is memcmp a CPU instruction or something? memcmp 是 CPU 指令还是什么? It must be pretty deep because I got a massive speedup using
memcmp
over the loop.它一定很深,因为我在循环中使用
memcmp
获得了巨大的加速。
memcmp
is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C. memcmp
经常在装配实现采取若干的架构的特定功能,它可以使远快于在C.一个简单的循环的优点
GCC supports memcmp
(as well as a ton of other functions) as builtins . GCC 支持
memcmp
(以及大量其他函数)作为内置函数。 In some versions / configurations of GCC, a call to memcmp
will be recognized as __builtin_memcmp
.在 GCC 的某些版本/配置中,对
memcmp
的调用将被识别为__builtin_memcmp
。 Instead of emitting a call
to the memcmp
library function, GCC will emit a handful of instructions to act as an optimized inline version of the function. GCC 不会发出对
memcmp
库函数的call
,而是发出一些指令来充当该函数的优化内联版本。
On x86, this leverages the use of the cmpsb
instruction, which compares a string of bytes at one memory location to another.在 x86 上,这利用了
cmpsb
指令的使用,该指令将一个内存位置的字节串与另一个进行比较。 This is coupled with the repe
prefix, so the strings are compared until they are no longer equal, or a count is exhausted.这与
repe
前缀相结合,因此将比较字符串,直到它们不再相等,或者计数用完为止。 (Exactly what memcmp
does). (正是
memcmp
所做的)。
Given the following code:鉴于以下代码:
int test(const void* s1, const void* s2, int count)
{
return memcmp(s1, s2, count) == 0;
}
gcc version 3.4.4
on Cygwin generates the following assembly: Cygwin 上的
gcc version 3.4.4
生成以下程序集:
; (prologue)
mov esi, [ebp+arg_0] ; Move first pointer to esi
mov edi, [ebp+arg_4] ; Move second pointer to edi
mov ecx, [ebp+arg_8] ; Move length to ecx
cld ; Clear DF, the direction flag, so comparisons happen
; at increasing addresses
cmp ecx, ecx ; Special case: If length parameter to memcmp is
; zero, don't compare any bytes.
repe cmpsb ; Compare bytes at DS:ESI and ES:EDI, setting flags
; Repeat this while equal ZF is set
setz al ; Set al (return value) to 1 if ZF is still set
; (all bytes were equal).
; (epilogue)
Reference:参考:
Highly-optimized versions of memcmp
exist in many C standard libraries.高度优化的
memcmp
版本存在于许多 C 标准库中。 These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.这些通常会利用特定于架构的指令并行处理大量数据。
In Glibc, there are versions of memcmp
for x86_64 that can take advantage of the following instruction set extensions:在 Glibc 中,有适用于 x86_64的
memcmp
版本可以利用以下指令集扩展:
sysdeps/x86_64/memcmp.S
sysdeps/x86_64/memcmp.S
sysdeps/x86_64/multiarch/memcmp-sse4.S
sysdeps/x86_64/multiarch/memcmp-sse4.S
sysdeps/x86_64/multiarch/memcmp-ssse3.S
sysdeps/x86_64/multiarch/memcmp-ssse3.S
The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it.很酷的部分是 glibc 将检测(在运行时)您的 CPU 具有的最新指令集,并执行为其优化的版本。 See this snippet from
sysdeps/x86_64/multiarch/memcmp.S
:从
sysdeps/x86_64/multiarch/memcmp.S
看到这个片段:
ENTRY(memcmp)
.type memcmp, @gnu_indirect_function
LOAD_RTLD_GLOBAL_RO_RDX
HAS_CPU_FEATURE (SSSE3)
jnz 2f
leaq __memcmp_sse2(%rip), %rax
ret
2: HAS_CPU_FEATURE (SSE4_1)
jz 3f
leaq __memcmp_sse4_1(%rip), %rax
ret
3: leaq __memcmp_ssse3(%rip), %rax
ret
END(memcmp)
Linux does not seem to have an optimized version of memcmp
for x86_64, but it does for memcpy
, in arch/x86/lib/memcpy_64.S
. Linux 似乎没有针对 x86_64 的
memcmp
优化版本,但在arch/x86/lib/memcpy_64.S
中有针对memcpy
的优化版本。 Note that is uses the alternatives infrastructure ( arch/x86/kernel/alternative.c
) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.请注意,它使用替代基础架构(
arch/x86/kernel/alternative.c
)不仅在运行时决定使用哪个版本,而且实际上修补自身以仅在启动时做出此决定。
Is memcmp a CPU instruction or something?
memcmp 是 CPU 指令还是什么?
It is at least a very highly optimized compiler-provided intrinsic function.它至少是一个高度优化的编译器提供的内在函数。 Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.
可能是一条或两条机器指令,具体取决于您尚未指定的平台。
It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.它通常是一个编译器内在函数,它被翻译成具有用于比较内存块的专门指令的快速汇编。
Yes, on intel hardware, there's a single assembly instruction for such a loop. 是的,在intel硬件上,有一个用于这种循环的汇编指令。 The runtime will use that.
运行时将使用它。 (I don't exactly remember, it was something like
rep cmps[b|w]
, depending also on the datasize) (我不记得,它有点像
rep cmps[b|w]
,还取决于数据量)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.