為什么這個版本的strcmp更慢？

Question

我一直在嘗試在某些條件下提高strcmp性能。 但是，遺憾的是，我甚至無法實現普通的vanilla strcmp以及庫的實現。

我看到了一個類似的問題，但答案說差異來自編譯器優化掉字符串文字的比較。 我的測試不使用字符串文字。

這是實現（ comparisons.cpp ）

int strcmp_custom(const char* a, const char* b) {
    while (*b == *a) {
        if (*a == '\0') return 0;
        a++;
        b++;
    }
    return *b - *a;
}

這是測試驅動程序（ driver.cpp ）：

#include "comparisons.h"

#include <array>
#include <chrono>
#include <iostream>

void init_string(char* str, int nChars) {
    // 10% of strings will be equal, and 90% of strings will have one char different.
    // This way, many strings will share long prefixes so strcmp has to exercise a bit.
    // Using random strings still shows the custom implementation as slower (just less so).
    str[nChars - 1] = '\0';
    for (int i = 0; i < nChars - 1; i++)
        str[i] = (i % 94) + 32;

    if (rand() % 10 != 0)
        str[rand() % (nChars - 1)] = 'x';
}

int main(int argc, char** argv) {
    srand(1234);

    // Pre-generate some strings to compare.
    const int kSampleSize = 100;
    std::array<char[1024], kSampleSize> strings;
    for (int i = 0; i < kSampleSize; i++)
        init_string(strings[i], kSampleSize);

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < kSampleSize; i++)
        for (int j = 0; j < kSampleSize; j++)
            strcmp(strings[i], strings[j]);

    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "strcmp        - " << (end - start).count() << std::endl;

    start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < kSampleSize; i++)
        for (int j = 0; j < kSampleSize; j++)
            strcmp_custom(strings[i], strings[j]);

    end = std::chrono::high_resolution_clock::now();
    std::cout << "strcmp_custom - " << (end - start).count() << std::endl;
}

我的makefile：

CC=clang++

test: driver.o comparisons.o
    $(CC) -o test driver.o comparisons.o

# Compile the test driver with optimizations off.
driver.o: driver.cpp comparisons.h
    $(CC) -c -o driver.o -std=c++11 -O0 driver.cpp

# Compile the code being tested separately with optimizations on.
comparisons.o: comparisons.cpp comparisons.h
    $(CC) -c -o comparisons.o -std=c++11 -O3 comparisons.cpp

clean:
    rm comparisons.o driver.o test

根據這個答案的建議，我在一個單獨的編譯單元中編譯了我的比較函數並進行了優化，並在關閉優化的情況下編譯了驅動程序，但我仍然得到了大約5倍的減速。

strcmp        - 154519
strcmp_custom - 506282

我也試過復制FreeBSD實現但得到了類似的結果。

我想知道我的表現測量是否忽略了某些東西。 或者標准庫實現做得更漂亮？

Answer 1

我不知道你有哪個標准庫，但只是為了讓你了解C庫維護者對優化字符串原語有多么認真，GNU libc在x86-64上使用的默認 strcmp是兩千手的手工優化匯編語言，版本2.24。 當SSSE3和SSE4.2指令集擴展可用時，有單獨的，也是手動優化的版本。 （該文件的復雜程度似乎是因為相同的源代碼用於生成其他幾個函數;機器代碼最終只是“1120”指令。）2.24大約一年前發布，甚至更多從那時起，工作就開始了。

他們去到這么多的麻煩，因為它很常見的字符串原語的一個是在一個配置文件中的一個最熱門的功能。

Answer 2

我的反匯編glibc v2.2.5，x86_64 linux的摘錄：

0000000000089cd0 <strcmp@@GLIBC_2.2.5>:
   89cd0:   48 8b 15 99 a1 33 00    mov    0x33a199(%rip),%rdx        # 3c3e70 <_IO_file_jumps@@GLIBC_2.2.5+0x790>
   89cd7:   48 8d 05 92 58 01 00    lea    0x15892(%rip),%rax        # 9f570 <strerror_l@@GLIBC_2.6+0x200>
   89cde:   f7 82 b0 00 00 00 10    testl  $0x10,0xb0(%rdx)
   89ce5:   00 00 00 
   89ce8:   75 1a                   jne    89d04 <strcmp@@GLIBC_2.2.5+0x34>
   89cea:   48 8d 05 9f 48 0c 00    lea    0xc489f(%rip),%rax        # 14e590 <__nss_passwd_lookup@@GLIBC_2.2.5+0x9c30>
   89cf1:   f7 82 80 00 00 00 00    testl  $0x200,0x80(%rdx)
   89cf8:   02 00 00 
   89cfb:   75 07                   jne    89d04 <strcmp@@GLIBC_2.2.5+0x34>
   89cfd:   48 8d 05 0c 00 00 00    lea    0xc(%rip),%rax        # 89d10 <strcmp@@GLIBC_2.2.5+0x40>
   89d04:   c3                      retq
   89d05:   90                      nop
   89d06:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   89d0d:   00 00 00 
   89d10:   89 f1                   mov    %esi,%ecx
   89d12:   89 f8                   mov    %edi,%eax
   89d14:   48 83 e1 3f             and    $0x3f,%rcx
   89d18:   48 83 e0 3f             and    $0x3f,%rax
   89d1c:   83 f9 30                cmp    $0x30,%ecx
   89d1f:   77 3f                   ja     89d60 <strcmp@@GLIBC_2.2.5+0x90>
   89d21:   83 f8 30                cmp    $0x30,%eax
   89d24:   77 3a                   ja     89d60 <strcmp@@GLIBC_2.2.5+0x90>
   89d26:   66 0f 12 0f             movlpd (%rdi),%xmm1
   89d2a:   66 0f 12 16             movlpd (%rsi),%xmm2
   89d2e:   66 0f 16 4f 08          movhpd 0x8(%rdi),%xmm1
   89d33:   66 0f 16 56 08          movhpd 0x8(%rsi),%xmm2
   89d38:   66 0f ef c0             pxor   %xmm0,%xmm0
   89d3c:   66 0f 74 c1             pcmpeqb %xmm1,%xmm0
   89d40:   66 0f 74 ca             pcmpeqb %xmm2,%xmm1
   89d44:   66 0f f8 c8             psubb  %xmm0,%xmm1
   89d48:   66 0f d7 d1             pmovmskb %xmm1,%edx
   89d4c:   81 ea ff ff 00 00       sub    $0xffff,%edx
...

真實的是1183行匯編，在檢測系統功能和矢量化指令方面有很多潛在的聰明才智。 libc維護者知道他們可以通過優化應用程序數千次調用的一些函數來獲得優勢。

為了比較，你的版本在-O3 ：

comparisons.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_Z13strcmp_customPKcS0_>:
int strcmp_custom(const char* a, const char* b) {
    while (*b == *a) {
   0:   8a 0e                   mov    (%rsi),%cl
   2:   8a 07                   mov    (%rdi),%al
   4:   38 c1                   cmp    %al,%cl
   6:   75 1e                   jne    26 <_Z13strcmp_customPKcS0_+0x26>
        if (*a == '\0') return 0;
   8:   48 ff c6                inc    %rsi
   b:   48 ff c7                inc    %rdi
   e:   66 90                   xchg   %ax,%ax
  10:   31 c0                   xor    %eax,%eax
  12:   84 c9                   test   %cl,%cl
  14:   74 18                   je     2e <_Z13strcmp_customPKcS0_+0x2e>
int strcmp_custom(const char* a, const char* b) {
    while (*b == *a) {
  16:   0f b6 0e                movzbl (%rsi),%ecx
  19:   0f b6 07                movzbl (%rdi),%eax
  1c:   48 ff c6                inc    %rsi
  1f:   48 ff c7                inc    %rdi
  22:   38 c1                   cmp    %al,%cl
  24:   74 ea                   je     10 <_Z13strcmp_customPKcS0_+0x10>
  26:   0f be d0                movsbl %al,%edx
  29:   0f be c1                movsbl %cl,%eax
        if (*a == '\0') return 0;
        a++;
        b++;
    }
    return *b - *a;
  2c:   29 d0                   sub    %edx,%eax
}
  2e:   c3                      retq

為什么這個版本的strcmp更慢？

問題描述

2 個解決方案

解決方案1
37 已采納 2017-07-02 20:11:37

解決方案2
8 2017-07-02 20:22:25

為什么這個版本的strcmp更慢？

問題描述

2 個解決方案

解決方案1 37 已采納 2017-07-02 20:11:37

解決方案2 8 2017-07-02 20:22:25

解決方案1
37 已采納 2017-07-02 20:11:37

解決方案2
8 2017-07-02 20:22:25