为什么'=='在std :: string上运行缓慢？

Question

在分析我的应用程序时，我意识到在字符串比较上花了很多时间。 所以我写了一个简单的基准测试，我很惊讶'=='比string :: compare和strcmp慢得多！ 这是代码，任何人都可以解释为什么会这样？ 或者我的代码有什么问题？ 因为根据标准'=='只是一个运算符重载而只是返回！lhs.compare（rhs）。

#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr  = 10000000000;//10 Billion
int len = 100;
int main() {
  srand(time(0));
  string s1(len,random()%128);
  string s2(len,random()%128);

uint64_t a = 0;
  Timer t;
  t.begin();
  for(uint64_t i =0;i<itr;i++){
    if(s1 == s2)
      a = i;
  }
  t.end();

  cout<<"==       took:"<<t.elapsedMillis()<<endl;

  t.begin();
  for(uint64_t i =0;i<itr;i++){
    if(s1.compare(s2)==0)
      a = i;
  }
  t.end();

  cout<<".compare took:"<<t.elapsedMillis()<<endl;

  t.begin();
  for(uint64_t i =0;i<itr;i++){
    if(strcmp(s1.c_str(),s2.c_str()))
      a = i;
  }
  t.end();

  cout<<"strcmp   took:"<<t.elapsedMillis()<<endl;

  return a;
}

这是结果：

==       took:5986.74
.compare took:0.000349
strcmp   took:0.000778

我的编译标志：

CXXFLAGS = -O3 -Wall -fmessage-length = 0 -std = c ++ 1y

我在x86_64 linux机器上使用gcc 4.9。

显然使用-o3进行了一些优化，我猜测完全推出了最后两个循环; 但是，使用-o2仍然是结果很奇怪：

10亿次迭代：

==       took:19591
.compare took:8318.01
strcmp   took:6480.35

PS Timer只是一个测量花费时间的包装类; 我完全相信：D

Timer类的代码：

#include <chrono>

#ifndef SRC_TIMER_H_
#define SRC_TIMER_H_


class Timer {
  std::chrono::steady_clock::time_point start;
  std::chrono::steady_clock::time_point stop;
public:
  Timer(){
    start = std::chrono::steady_clock::now();
    stop = std::chrono::steady_clock::now();
  }
  virtual ~Timer() {}

  inline void begin() {
    start = std::chrono::steady_clock::now();
  }

  inline void end() {
    stop = std::chrono::steady_clock::now();
  }

  inline double elapsedMillis() {
    auto diff = stop - start;
    return  std::chrono::duration<double, std::milli> (diff).count();
  }

  inline double elapsedMicro() {
    auto diff = stop - start;
    return  std::chrono::duration<double, std::micro> (diff).count();
  }

  inline double elapsedNano() {
    auto diff = stop - start;
    return  std::chrono::duration<double, std::nano> (diff).count();
  }

  inline double elapsedSec() {
    auto diff = stop - start;
    return std::chrono::duration<double> (diff).count();
  }
};

#endif /* SRC_TIMER_H_ */

Answer 1

更新： http ://ideone.com/rGc36a上改进基准的输出

==       took:21
.compare took:21
strcmp   took:14
==       took:21
.compare took:25
strcmp   took:14

事实证明，使其有意义地工作至关重要的是“智取”编译器预测在编译时比较的字符串的能力：

// more strings that might be used...
string s[] = { {len,argc+'A'}, {len,argc+'A'}, {len, argc+'B'}, {len, argc+'B'} };

if(s[i&3].compare(s[(i+1)&3])==0)  // trickier to optimise
  a += i;  // cumulative observable side effects

请注意，一般情况下，当文本可能嵌入NUL时， strcmp在功能上并不等同于==或.compare ，因为前者将“早退”。 （这不是它上面“更快”的原因，但请阅读下面的评论，以及字符串长度/内容等可能的变化。）

讨论/早期答案

只需看看您的实施 - 例如

echo '#include <string>' > stringE.cc
g++ -E stringE.cc | less

搜索basic_string模板，然后为运算符==处理两个字符串实例 - 我的是：

template<class _Elem,
    class _Traits,
    class _Alloc> inline
    bool __cdecl operator==(
            const basic_string<_Elem, _Traits, _Alloc>& _Left,
            const basic_string<_Elem, _Traits, _Alloc>& _Right)
    {
    return (_Left.compare(_Right) == 0);
    }

请注意， operator==是内联的，只是调用compare 。 在启用正常优化级别的情况下，它无法始终显着降低，尽管由于周围代码的微妙副作用，优化器可能偶尔会优于一个循环优于另一个循环。

你表面上的问题将一直例如被优化以后做打算工作点你的代码，造成for任意展开不同程度的影响，或在优化或你的时机其他怪癖或错误的循环。 当您拥有不具有任何累积副作用的不变输入和循环时，这并不罕见（即编译器可以计算出未使用a中间值，因此只有最后a = i需要生效）。

所以，学会写更好的基准。 在这种情况下，这有点棘手，因为在内存中有许多不同的字符串准备好调用比较，并以优化器无法在编译时预测的方式选择它们仍然足够快，不会压倒和模糊影响字符串比较代码，不是一件容易的事。 此外，超越一点 - 比较分布在更多内存中的事物会使缓存影响与基准测试更相关，这进一步模糊了真正的比较性能。

不过，如果我是你，我会从文件中读取一些字符串 - 将每个字符串推送到一个vector ，然后在vector循环执行相邻元素之间的三个比较操作。 然后编译器不可能预测结果中的任何模式。 您可能会发现compare / ==快/慢strcmp字符串的第一个字符或三个，但对于长字符串中等于或区别仅接近尾声的其他方式往往不同，所以一定要尝试各种不同的输入在您得出结论之前，您了解了性能配置文件。

Answer 2

要么你的计时很棘手，要么你的编译器已经优化了你的一些代码。

考虑一下，在0.000349毫秒内进行100亿次操作（我将使用0.000500毫秒，或半微秒，以使我的计算更容易）意味着您每秒执行20 万亿次操作。

即使一个操作可以在一个时钟周期内完成，也就是20,000 GHz，有点超出当前的CPU数量，即使是大量优化的流水线和多个内核。

并且，鉴于-O2优化数据彼此相当（ == compare时间的两倍），“代码优化不存在”的可能性看起来更有可能。

由于operator==需要调用compare来完成其工作，因此时间加倍可以很容易地解释为100亿额外的函数调用。

作为进一步的支持，请检查下表，以毫秒为单位显示数字（第三列是第二列的简单除以十的比例，以便第一列和第三列都进行十亿次迭代）：

         -O2/1billion  -O3/10billion  -O3/1billion  Improvement
               (a)            (b)     (c = b / 10)    (a / c)
         ============  =============  ============  ===========
oper==          19151           5987           599           32
compare          8319         0.0005       0.00005  166,380,000

它乞丐认为-O3可以将==代码加速大约32倍，但设法将compare代码加速几亿倍。

我强烈建议您查看编译器生成的汇编程序代码（例如使用gcc -S选项），以验证它是否正在执行它声称要执行的工作。

Answer 3

问题是编译器正在对您的代码进行大量的严格优化。

这是修改后的代码：

#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr  = 500000000;//10 Billion
int len = 100;
int main() {
  srand(time(0));
  string s1(len,random()%128);
  string s2(len,random()%128);

uint64_t a = 0;
  Timer t;
  t.begin();
  for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
    if(s1 == s2)
      a += i;
  }
  t.end();

  cout<<"==       took:"<<t.elapsedMillis()<<",a="<<a<<endl;

  t.begin();
  for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
    if(s1.compare(s2)==0)
      a+=i;
  }
  t.end();

  cout<<".compare took:"<<t.elapsedMillis()<<",a="<<a<<endl;

  t.begin();
  for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
    if(strcmp(s1.c_str(),s2.c_str()) == 0)
      a+=i;
  }
  t.end();

  cout<<"strcmp   took:"<<t.elapsedMillis()<<",a="<<a<< endl;

  return a;
}

我添加了asm volatile（“”：“+ g”（s2））; 强制编译器运行比较。 我还添加了<<“，a =”<强制编译器计算a。

输出现在是：

==       took:10221.5,a=0
.compare took:10739,a=0
strcmp   took:9700,a=0

你能解释一下为什么strcmp比.compare慢得多于= =？ 然而，速度差异很小，但很重要。

它实际上是有道理的！ ：p

Answer 4

下面的速度分析是错误的 - 感谢Tony D指出我的错误。 尽管如此，对更好的基准的批评和建议仍然适用。

以前的所有答案都涉及基准测试中的编译器优化问题，但是没有回答为什么strcmp仍然稍微快一些。

由于字符串有时包含零， strcmp可能更快（在更正的基准测试中）。 由于strcmp使用C字符串，因此当它遇到字符串终止字符'\\0'时它可以退出。 std::string::compare()将'\\0'视为另一个char并继续直到字符串数组的结尾。

由于您已经非确定地播种了RNG，并且只生成了两个字符串，因此每次运行代码时您的结果都会发生变化。 （我会在基准测试中反对这一点。）鉴于数字，128次中有28次，应该没有优势。 128个中的10个，你将获得超过10倍的速度。 等等。

除了击败编译器的优化器之外，我建议下次为每个比较迭代生成一个新字符串，以便平均掉这些效果。

Answer 5

使用gcc -O3 -S --std=c++1y编译代码。 结果就在这里。 gcc版本是：

gcc (Ubuntu 4.9.1-16ubuntu6) 4.9.1
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

看看它，我们可以是第一个循环（ operator == ）是这样的:(评论由我添加）

    movq    itr(%rip), %rbp
    movq    %rax, %r12
    movq    %rax, 56(%rsp)
    testq   %rbp, %rbp
    je  .L25
    movq    16(%rsp), %rdi
    movq    32(%rsp), %rsi
    xorl    %ebx, %ebx
    movq    -24(%rsi), %rdx  ; length of string1
    cmpq    -24(%rdi), %rdx  ; compare lengths
    je  .L53                 ; compare content only when length is the same
.L10
   ; end of loop, print out follows

;....
.L53:
    .cfi_restore_state
    call    memcmp      ; compare content
    xorl    %edx, %edx  ; zero loop count
    .p2align 4,,10
    .p2align 3
.L13:
    testl   %eax, %eax  ; check result
    cmove   %rdx, %rbx  ; a = i
    addq    $1, %rdx    ; i++
    cmpq    %rbp, %rdx  ; i < itr?
    jne .L13
    jmp .L10    

; ....
.L25:
    xorl    %ebx, %ebx
    jmp .L10

我们可以看到operator ==是内联的，只有对memcmp的调用。 对于operator == ，如果长度不同，则不比较内容。

最重要的是，比较只进行一次 。 循环内容仅包含i++; ， a=i; ， i<itr; 。

对于第二个循环（ compare() ）：

    movq    itr(%rip), %r12
    movq    %rax, %r13
    movq    %rax, 56(%rsp)
    testq   %r12, %r12
    je  .L14
    movq    16(%rsp), %rdi
    movq    32(%rsp), %rsi
    movq    -24(%rdi), %rbp
    movq    -24(%rsi), %r14  ; read and compare length
    movq    %rbp, %rdx
    cmpq    %rbp, %r14
    cmovbe  %r14, %rdx       ; save the shorter length of the two string to %rdx
    subq    %r14, %rbp       ; length difference in %rbp
    call    memcmp           ; content is always compared
    movl    $2147483648, %edx ; 0x80000000 sign extended
    addq    %rbp, %rdx       ; revert the sign bit of %rbp (length difference) and save to %rdx
    testl   %eax, %eax       ; memcmp returned 0?
    jne .L14                 ; no, string different
    testl   %ebp, %ebp       ; memcmp returned 0. Are lengths the same (%ebp == 0)?
    jne .L14                 ; no, string different
    movl    $4294967295, %eax ; string compare equal
    subq    $1, %r12         ; itr - 1
    cmpq    %rax, %rdx
    cmovbe  %r12, %rbx       ; a = itr - 1
.L14:
    ; output follows

这里根本没有循环。

在compare() ，因为它应该根据比较返回加号，减号或零，所以始终比较字符串内容。 memcmp叫了一次。

对于第三个循环（ strcmp() ），程序集是最简单的：

    movq    itr(%rip), %rbp   ; itr to %rbp
    movq    %rax, %r12
    movq    %rax, 56(%rsp)
    testq   %rbp, %rbp
    je  .L16
    movq    32(%rsp), %rsi
    movq    16(%rsp), %rdi
    subq    $1, %rbp       ; itr - 1 to %rbp
    call    strcmp
    testl   %eax, %eax     ; test compare result
    cmovne  %rbp, %rbx     ; if not equal, save itr - 1 to %rbx (a)
.L16:

这些也没有循环。 调用strcmp ，如果字符串不相等（如代码中所示），请将itr-1直接保存到a 。

因此，您的基准测试无法测试operator == ， compare()或strcmp()的运行时间。 这些都只被调用一次，无法显示运行时间差异。

至于为什么operator ==花费最多的时间，这是因为对于operator== ，编译器由于某种原因没有消除循环。 循环需要时间（但循环根本不包含字符串比较）。

从显示的程序集中，我们可以假设operator ==可能是最快的，因为如果两个字符串的长度不同，它根本不会进行字符串比较。 （当然，在gcc4.9.1 -O3下）

为什么'=='在std :: string上运行缓慢？

问题描述

5 个解决方案

解决方案1
13 2015-02-26 05:52:30

解决方案2
12 2015-02-26 05:08:21

解决方案3
8 2015-02-26 06:00:29

解决方案4
8 2015-02-26 11:06:20

解决方案5
1 2015-03-02 04:17:47

为什么&#39;==&#39;在std :: string上运行缓慢？

问题描述

5 个解决方案

解决方案1 13 2015-02-26 05:52:30

解决方案2 12 2015-02-26 05:08:21

解决方案3 8 2015-02-26 06:00:29

解决方案4 8 2015-02-26 11:06:20

解决方案5 1 2015-03-02 04:17:47

为什么'=='在std :: string上运行缓慢？

解决方案1
13 2015-02-26 05:52:30

解决方案2
12 2015-02-26 05:08:21

解决方案3
8 2015-02-26 06:00:29

解决方案4
8 2015-02-26 11:06:20

解决方案5
1 2015-03-02 04:17:47