Why '==' is slow on std::string?

Question

While profiling my application I realized that a lot of time is spent on string comparisons. So I wrote a simple benchmark and I was surprised that '==' is much slower than string::compare and strcmp! here is the code, can anyone explain why is that? or what's wrong with my code? because according to the standard '==' is just an operator overload and simply returnes !lhs.compare(rhs).

#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr  = 10000000000;//10 Billion
int len = 100;
int main() {
  srand(time(0));
  string s1(len,random()%128);
  string s2(len,random()%128);

uint64_t a = 0;
  Timer t;
  t.begin();
  for(uint64_t i =0;i<itr;i++){
    if(s1 == s2)
      a = i;
  }
  t.end();

  cout<<"==       took:"<<t.elapsedMillis()<<endl;

  t.begin();
  for(uint64_t i =0;i<itr;i++){
    if(s1.compare(s2)==0)
      a = i;
  }
  t.end();

  cout<<".compare took:"<<t.elapsedMillis()<<endl;

  t.begin();
  for(uint64_t i =0;i<itr;i++){
    if(strcmp(s1.c_str(),s2.c_str()))
      a = i;
  }
  t.end();

  cout<<"strcmp   took:"<<t.elapsedMillis()<<endl;

  return a;
}

And here is the result:

==       took:5986.74
.compare took:0.000349
strcmp   took:0.000778

And my compile flags:

CXXFLAGS = -O3 -Wall -fmessage-length=0 -std=c++1y

I use gcc 4.9 on a x86_64 linux machine.

Obviously using -o3 does some optimizations which I guess rolls out the last two loops totally; however, using -o2 still the results are weird:

for 1 billion iterations:

==       took:19591
.compare took:8318.01
strcmp   took:6480.35

PS Timer is just a wrapper class to measure spent time; I am absolutely sure about it :D

Code for Timer class:

#include <chrono>

#ifndef SRC_TIMER_H_
#define SRC_TIMER_H_


class Timer {
  std::chrono::steady_clock::time_point start;
  std::chrono::steady_clock::time_point stop;
public:
  Timer(){
    start = std::chrono::steady_clock::now();
    stop = std::chrono::steady_clock::now();
  }
  virtual ~Timer() {}

  inline void begin() {
    start = std::chrono::steady_clock::now();
  }

  inline void end() {
    stop = std::chrono::steady_clock::now();
  }

  inline double elapsedMillis() {
    auto diff = stop - start;
    return  std::chrono::duration<double, std::milli> (diff).count();
  }

  inline double elapsedMicro() {
    auto diff = stop - start;
    return  std::chrono::duration<double, std::micro> (diff).count();
  }

  inline double elapsedNano() {
    auto diff = stop - start;
    return  std::chrono::duration<double, std::nano> (diff).count();
  }

  inline double elapsedSec() {
    auto diff = stop - start;
    return std::chrono::duration<double> (diff).count();
  }
};

#endif /* SRC_TIMER_H_ */

Answer 1

UPDATE: output of improved benchmark at http://ideone.com/rGc36a

==       took:21
.compare took:21
strcmp   took:14
==       took:21
.compare took:25
strcmp   took:14

The thing that proved crucial to get it working meaningfully was "outwitting" the compiler's ability to predict the strings being compared at compile time:

// more strings that might be used...
string s[] = { {len,argc+'A'}, {len,argc+'A'}, {len, argc+'B'}, {len, argc+'B'} };

if(s[i&3].compare(s[(i+1)&3])==0)  // trickier to optimise
  a += i;  // cumulative observable side effects

Note that in general, strcmp is not functionally equivalent to == or .compare when the text may embed NULs, as the former will get to "exit early". (That's not the reason it's "faster" above, but do read below for comments re possible variations with string length/content etc..)

Discussion / Earlier answer

Just have a look at your implementation - eg

echo '#include <string>' > stringE.cc
g++ -E stringE.cc | less

Search for the basic_string template, then for the operator== working on two string instances - mine is:

template<class _Elem,
    class _Traits,
    class _Alloc> inline
    bool __cdecl operator==(
            const basic_string<_Elem, _Traits, _Alloc>& _Left,
            const basic_string<_Elem, _Traits, _Alloc>& _Right)
    {
    return (_Left.compare(_Right) == 0);
    }

Notice that operator== is inline and simply calls compare . There's no way it's consistently significantly slower with normal optimisation levels enabled, though the optimiser might occasionally happen to optimise one loop better than another due to subtle side effects of surrounding code.

Your ostensible problem will have been caused by eg your code being optimised beyond the point of doing the intended work, for loops arbitrarily unrolled to different degrees, or other quirks or bugs in the optimisation or your timing. That's not unusual when you have unvarying inputs and loops that don't have any cumulative side-effects (ie the compiler can work out that intermediate values of a are not used, so only the last a = i need take affect).

So, learn to write better benchmarks. In this case, that's a bit tricky as having lots of distinct strings in memory ready to invoke the comparisons on, and selecting them in a way that the optimiser can't predict at compile time that's still fast enough not to overwhelm and obscure the impact of the string comparison code, is not an easy task. Further, beyond a point - comparing things spread across more memory makes cache affects more relevant to the benchmark, which further obscures the real comparison performance.

Still, if I were you I'd read some strings from a file - pushing each to a vector , then loop over the vector doing each of the three comparison operations between adjacent elements. Then the compiler can't possibly predict any pattern in the outcomes. You might find compare / == faster/slower than strcmp for strings often differing in the first character or three, but the other way around for long strings that are equal or only differing near the end, so make sure you try different kinds of input before you conclude you understand the performance profile.

Answer 2

Either your timings are screwy, or your compiler has optimised some of your code out of existence.

Think about it, ten billion operations in 0.000349 milliseconds (I'll use 0.000500 milliseconds, or half a microsecond, to make my calculations easier) means that you're performing twenty trillion operations per second.

Even if one operation could be done in a single clock cycle, that would be 20,000 GHz, a bit beyond the current crop of CPUs, even with their massively optimised pipelines and multiple cores.

And, given that the -O2 optimised figures are more on par with each other ( == taking about double the time of compare ), the "code optimised out of existence" possibility is looking far more likely.

The doubling of time could easily be explained as ten billion extra function calls, due to operator== needing to call compare to do its work.

As further support, examine the following table, showing figures in milliseconds (third column is simple divide-by-ten scale of second column so that both first and third columns are for a billion iterations):

         -O2/1billion  -O3/10billion  -O3/1billion  Improvement
               (a)            (b)     (c = b / 10)    (a / c)
         ============  =============  ============  ===========
oper==          19151           5987           599           32
compare          8319         0.0005       0.00005  166,380,000

It beggars belief that -O3 could speed up the == code by a factor of about 32 but manage to speed up the compare code by a factor of a few hundred million.

I strongly suggest you have a look at the assembler code generated by your compiler (such as with the gcc -S option) to verify that it's actually doing that work it's claiming to do.

Answer 3

The problem is that the compiler is making a lot of serious optimizations to your code.

Here's the modified code:

#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr  = 500000000;//10 Billion
int len = 100;
int main() {
  srand(time(0));
  string s1(len,random()%128);
  string s2(len,random()%128);

uint64_t a = 0;
  Timer t;
  t.begin();
  for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
    if(s1 == s2)
      a += i;
  }
  t.end();

  cout<<"==       took:"<<t.elapsedMillis()<<",a="<<a<<endl;

  t.begin();
  for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
    if(s1.compare(s2)==0)
      a+=i;
  }
  t.end();

  cout<<".compare took:"<<t.elapsedMillis()<<",a="<<a<<endl;

  t.begin();
  for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
    if(strcmp(s1.c_str(),s2.c_str()) == 0)
      a+=i;
  }
  t.end();

  cout<<"strcmp   took:"<<t.elapsedMillis()<<",a="<<a<< endl;

  return a;
}

where I've added asm volatile("" : "+g"(s2)); to force the compiler to run the comparison. I've also added <<",a="< to force the compiler to compute a.

The output is now:

==       took:10221.5,a=0
.compare took:10739,a=0
strcmp   took:9700,a=0

Can you explain why strcmp is faster than .compare which is slower than ==? however, the speed differences are marginal, but significant.

It actually makes sense! :p

Answer 4

The speed analysis below is wrong - thanks to Tony D for pointing out my error. The criticisms and advice for better benchmarks still apply though.

All the previous answers deal with the compiler optimisation issues in your benchmark, but don't answer why strcmp is still slightly faster.

strcmp is likely faster (in the corrected benchmarks) due to the strings sometimes containing zeros. Since strcmp uses C-strings it can exit when it comes across the string termination char '\\0' . std::string::compare() treats '\\0' as just another char and continues until the end of the string array.

Since you have non-deterministically seeded the RNG, and only generated two strings, your results will change with every run of the code. (I'd advise against this in benchmarks.) Given the numbers, 28 times out of 128, there ought to be no advantage. 10 times out of 128 you will get more than a 10-fold speed up. And so on.

As well as defeating the compiler's optimiser, I would suggest that, next time, you generate a new string for each comparison iteration, allowing you to average away such effects.

Answer 5

Compiled the code with gcc -O3 -S --std=c++1y . The result is here . gcc version is:

gcc (Ubuntu 4.9.1-16ubuntu6) 4.9.1
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Look at it, we can be the first loop ( operator == ) is like this: (comment is added by me)

    movq    itr(%rip), %rbp
    movq    %rax, %r12
    movq    %rax, 56(%rsp)
    testq   %rbp, %rbp
    je  .L25
    movq    16(%rsp), %rdi
    movq    32(%rsp), %rsi
    xorl    %ebx, %ebx
    movq    -24(%rsi), %rdx  ; length of string1
    cmpq    -24(%rdi), %rdx  ; compare lengths
    je  .L53                 ; compare content only when length is the same
.L10
   ; end of loop, print out follows

;....
.L53:
    .cfi_restore_state
    call    memcmp      ; compare content
    xorl    %edx, %edx  ; zero loop count
    .p2align 4,,10
    .p2align 3
.L13:
    testl   %eax, %eax  ; check result
    cmove   %rdx, %rbx  ; a = i
    addq    $1, %rdx    ; i++
    cmpq    %rbp, %rdx  ; i < itr?
    jne .L13
    jmp .L10    

; ....
.L25:
    xorl    %ebx, %ebx
    jmp .L10

We can see that operator == is inline, only a call to memcmp is there. And for operator == , if the length is different, the content is not compared.

Most importantly, compare is done only once . The loop content only contains i++; , a=i; , i<itr; .

For the second loop ( compare() ):

    movq    itr(%rip), %r12
    movq    %rax, %r13
    movq    %rax, 56(%rsp)
    testq   %r12, %r12
    je  .L14
    movq    16(%rsp), %rdi
    movq    32(%rsp), %rsi
    movq    -24(%rdi), %rbp
    movq    -24(%rsi), %r14  ; read and compare length
    movq    %rbp, %rdx
    cmpq    %rbp, %r14
    cmovbe  %r14, %rdx       ; save the shorter length of the two string to %rdx
    subq    %r14, %rbp       ; length difference in %rbp
    call    memcmp           ; content is always compared
    movl    $2147483648, %edx ; 0x80000000 sign extended
    addq    %rbp, %rdx       ; revert the sign bit of %rbp (length difference) and save to %rdx
    testl   %eax, %eax       ; memcmp returned 0?
    jne .L14                 ; no, string different
    testl   %ebp, %ebp       ; memcmp returned 0. Are lengths the same (%ebp == 0)?
    jne .L14                 ; no, string different
    movl    $4294967295, %eax ; string compare equal
    subq    $1, %r12         ; itr - 1
    cmpq    %rax, %rdx
    cmovbe  %r12, %rbx       ; a = itr - 1
.L14:
    ; output follows

There no loop at all here.

In compare() , as it should return plus, minus, or zero based on the comparison, string content is always compared. memcmp called once.

For the third loop ( strcmp() ), the assembly is the most simple:

    movq    itr(%rip), %rbp   ; itr to %rbp
    movq    %rax, %r12
    movq    %rax, 56(%rsp)
    testq   %rbp, %rbp
    je  .L16
    movq    32(%rsp), %rsi
    movq    16(%rsp), %rdi
    subq    $1, %rbp       ; itr - 1 to %rbp
    call    strcmp
    testl   %eax, %eax     ; test compare result
    cmovne  %rbp, %rbx     ; if not equal, save itr - 1 to %rbx (a)
.L16:

These also no loop at all. strcmp is called, and if the strings are not equal (as in your code), save itr-1 to a directly.

So your benchmark cannot test the running time for operator == , compare() or strcmp() . The are all called only once, not able to show the running time difference.

As to why operator == takes the most time, it is because for operator== , the compiler for some reason did not eliminate the loop. The loop takes time (but the loop does not contain string comparison at all).

And from the assembly shown, we may assume that operator == may be fastest, because it won't do string comparison at all if the length of the two strings are different. (Of course, under gcc4.9.1 -O3)

Why '==' is slow on std::string?

Question

5 answers

solution1
13 2015-02-26 05:52:30

solution2
12 2015-02-26 05:08:21

solution3
8 2015-02-26 06:00:29

solution4
8 2015-02-26 11:06:20

solution5
1 2015-03-02 04:17:47

Why '==' is slow on std::string?

Question

5 answers

solution1 13 2015-02-26 05:52:30

solution2 12 2015-02-26 05:08:21

solution3 8 2015-02-26 06:00:29

solution4 8 2015-02-26 11:06:20

solution5 1 2015-03-02 04:17:47

solution1
13 2015-02-26 05:52:30

solution2
12 2015-02-26 05:08:21

solution3
8 2015-02-26 06:00:29

solution4
8 2015-02-26 11:06:20

solution5
1 2015-03-02 04:17:47