简体   繁体   English

为什么 memcmp 比 for 循环检查快这么多?

[英]Why is memcmp so much faster than a for loop check?

Why is memcmp(a, b, size) so much faster than:为什么memcmp(a, b, size)比:

for(i = 0; i < nelements; i++) {
    if a[i] != b[i] return 0;
}
return 1;

Is memcmp a CPU instruction or something? memcmp 是 CPU 指令还是什么? It must be pretty deep because I got a massive speedup using memcmp over the loop.它一定很深,因为我在循环中使用memcmp获得了巨大的加速。

memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C. memcmp经常在装配实现采取若干的架构的特定功能,它可以使快于在C.一个简单的循环的优点

As a "builtin"作为“内置”

GCC supports memcmp (as well as a ton of other functions) as builtins . GCC 支持memcmp (以及大量其他函数)作为内置函数。 In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp .在 GCC 的某些版本/配置中,对memcmp的调用将被识别为__builtin_memcmp Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function. GCC 不会发出对memcmp库函数的call ,而是发出一些指令来充当该函数的优化内联版本。

On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another.在 x86 上,这利用了cmpsb指令的使用,该指令将一个内存位置的字节串与另一个进行比较。 This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted.这与repe前缀相结合,因此将比较字符串,直到它们不再相等,或者计数用完为止。 (Exactly what memcmp does). (正是memcmp所做的)。

Given the following code:鉴于以下代码:

int test(const void* s1, const void* s2, int count)
{
    return memcmp(s1, s2, count) == 0;
}

gcc version 3.4.4 on Cygwin generates the following assembly: Cygwin 上的gcc version 3.4.4生成以下程序集:

; (prologue)
mov     esi, [ebp+arg_0]    ; Move first pointer to esi
mov     edi, [ebp+arg_4]    ; Move second pointer to edi
mov     ecx, [ebp+arg_8]    ; Move length to ecx

cld                         ; Clear DF, the direction flag, so comparisons happen
                            ; at increasing addresses
cmp     ecx, ecx            ; Special case: If length parameter to memcmp is
                            ; zero, don't compare any bytes.
repe cmpsb                  ; Compare bytes at DS:ESI and ES:EDI, setting flags
                            ; Repeat this while equal ZF is set
setz    al                  ; Set al (return value) to 1 if ZF is still set
                            ; (all bytes were equal).
; (epilogue) 

Reference:参考:

As a library function作为库函数

Highly-optimized versions of memcmp exist in many C standard libraries.高度优化的memcmp版本存在于许多 C 标准库中。 These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.这些通常会利用特定于架构的指令并行处理大量数据。

In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:在 Glibc 中,有适用于 x86_64memcmp版本可以利用以下指令集扩展:

The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it.很酷的部分是 glibc 将检测(在运行时)您的 CPU 具有的最新指令集,并执行为其优化的版本。 See this snippet from sysdeps/x86_64/multiarch/memcmp.S :sysdeps/x86_64/multiarch/memcmp.S看到这个片段:

ENTRY(memcmp)
    .type   memcmp, @gnu_indirect_function
    LOAD_RTLD_GLOBAL_RO_RDX
    HAS_CPU_FEATURE (SSSE3)
    jnz 2f
    leaq    __memcmp_sse2(%rip), %rax
    ret 

2:  HAS_CPU_FEATURE (SSE4_1)
    jz  3f  
    leaq    __memcmp_sse4_1(%rip), %rax
    ret 

3:  leaq    __memcmp_ssse3(%rip), %rax
    ret 

END(memcmp)

In the Linux kernel在 Linux 内核中

Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy , in arch/x86/lib/memcpy_64.S . Linux 似乎没有针对 x86_64 的memcmp优化版本,但在arch/x86/lib/memcpy_64.S中有针对memcpy的优化版本。 Note that is uses the alternatives infrastructure ( arch/x86/kernel/alternative.c ) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.请注意,它使用替代基础架构( arch/x86/kernel/alternative.c )不仅在运行时决定使用哪个版本,而且实际上修补自身以仅在启动时做出此决定。

Is memcmp a CPU instruction or something? memcmp 是 CPU 指令还是什么?

It is at least a very highly optimized compiler-provided intrinsic function.它至少是一个高度优化的编译器提供的内在函数。 Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.可能是一条或两条机器指令,具体取决于您尚未指定的平台。

It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.它通常是一个编译器内在函数,它被翻译成具有用于比较内存块的专门指令的快速汇编。

intrinsic memcmp 内在的 memcmp

Yes, on intel hardware, there's a single assembly instruction for such a loop. 是的,在intel硬件上,有一个用于这种循环的汇编指令。 The runtime will use that. 运行时将使用它。 (I don't exactly remember, it was something like rep cmps[b|w] , depending also on the datasize) (我不记得,它有点像rep cmps[b|w] ,还取决于数据量)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么用C语言复制文件比C ++快得多? - Why is copying a file in C so much faster than C++? 为什么嵌套for循环比展开相同代码要慢得多? - Why is this nested for loop so much slower than unrolling the same code? 为什么函数式代码比 C 中的命令式代码快得多? - Why is functional-styled code so much faster than imperative code in C? 为什么PostgreSQL数组在C中的访问速度比在PL / pgSQL中快得多? - Why is PostgreSQL array access so much faster in C than in PL/pgSQL? 为什么Faile比简单国际象棋程序(TSCP)快得多? (国际象棋引擎优化) - Why is Faile so much faster than The Simple Chess Program (TSCP)? (Chess engine optimization) 为什么初始化gl3w比初始化GLEW这么快? - Why is initializing gl3w so much faster than initializing GLEW? 为什么cffi比numpy快得多? - Why is cffi so much quicker than numpy? 为什么需要花费这么长时间才能完成100亿到10亿? - Why does it take so much longer to loop through 10 billion than 1 billion? avr-gcc:循环,&gt; =快于&gt;检查 - avr-gcc: loop with >= faster than > check OpenMP omp fork 2线程比fork 4线程快得多,为什么? - OpenMP omp fork 2 threads much faster than fork 4 threads, why?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM