While profiling my application I realized that a lot of time is spent on string comparisons. So I wrote a simple benchmark and I was surprised that '==' is much slower than string::compare and strcmp! here is the code, can anyone explain why is that? or what's wrong with my code? because according to the standard '==' is just an operator overload and simply returnes !lhs.compare(rhs).
#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr = 10000000000;//10 Billion
int len = 100;
int main() {
srand(time(0));
string s1(len,random()%128);
string s2(len,random()%128);
uint64_t a = 0;
Timer t;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(s1 == s2)
a = i;
}
t.end();
cout<<"== took:"<<t.elapsedMillis()<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(s1.compare(s2)==0)
a = i;
}
t.end();
cout<<".compare took:"<<t.elapsedMillis()<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
if(strcmp(s1.c_str(),s2.c_str()))
a = i;
}
t.end();
cout<<"strcmp took:"<<t.elapsedMillis()<<endl;
return a;
}
And here is the result:
== took:5986.74
.compare took:0.000349
strcmp took:0.000778
And my compile flags:
CXXFLAGS = -O3 -Wall -fmessage-length=0 -std=c++1y
I use gcc 4.9 on a x86_64 linux machine.
Obviously using -o3 does some optimizations which I guess rolls out the last two loops totally; however, using -o2 still the results are weird:
for 1 billion iterations:
== took:19591
.compare took:8318.01
strcmp took:6480.35
PS Timer is just a wrapper class to measure spent time; I am absolutely sure about it :D
Code for Timer class:
#include <chrono>
#ifndef SRC_TIMER_H_
#define SRC_TIMER_H_
class Timer {
std::chrono::steady_clock::time_point start;
std::chrono::steady_clock::time_point stop;
public:
Timer(){
start = std::chrono::steady_clock::now();
stop = std::chrono::steady_clock::now();
}
virtual ~Timer() {}
inline void begin() {
start = std::chrono::steady_clock::now();
}
inline void end() {
stop = std::chrono::steady_clock::now();
}
inline double elapsedMillis() {
auto diff = stop - start;
return std::chrono::duration<double, std::milli> (diff).count();
}
inline double elapsedMicro() {
auto diff = stop - start;
return std::chrono::duration<double, std::micro> (diff).count();
}
inline double elapsedNano() {
auto diff = stop - start;
return std::chrono::duration<double, std::nano> (diff).count();
}
inline double elapsedSec() {
auto diff = stop - start;
return std::chrono::duration<double> (diff).count();
}
};
#endif /* SRC_TIMER_H_ */
UPDATE: output of improved benchmark at http://ideone.com/rGc36a
== took:21
.compare took:21
strcmp took:14
== took:21
.compare took:25
strcmp took:14
The thing that proved crucial to get it working meaningfully was "outwitting" the compiler's ability to predict the strings being compared at compile time:
// more strings that might be used...
string s[] = { {len,argc+'A'}, {len,argc+'A'}, {len, argc+'B'}, {len, argc+'B'} };
if(s[i&3].compare(s[(i+1)&3])==0) // trickier to optimise
a += i; // cumulative observable side effects
Note that in general, strcmp
is not functionally equivalent to ==
or .compare
when the text may embed NULs, as the former will get to "exit early". (That's not the reason it's "faster" above, but do read below for comments re possible variations with string length/content etc..)
Discussion / Earlier answer
Just have a look at your implementation - eg
echo '#include <string>' > stringE.cc
g++ -E stringE.cc | less
Search for the basic_string template, then for the operator== working on two string instances - mine is:
template<class _Elem,
class _Traits,
class _Alloc> inline
bool __cdecl operator==(
const basic_string<_Elem, _Traits, _Alloc>& _Left,
const basic_string<_Elem, _Traits, _Alloc>& _Right)
{
return (_Left.compare(_Right) == 0);
}
Notice that operator==
is inline and simply calls compare
. There's no way it's consistently significantly slower with normal optimisation levels enabled, though the optimiser might occasionally happen to optimise one loop better than another due to subtle side effects of surrounding code.
Your ostensible problem will have been caused by eg your code being optimised beyond the point of doing the intended work, for
loops arbitrarily unrolled to different degrees, or other quirks or bugs in the optimisation or your timing. That's not unusual when you have unvarying inputs and loops that don't have any cumulative side-effects (ie the compiler can work out that intermediate values of a
are not used, so only the last a = i
need take affect).
So, learn to write better benchmarks. In this case, that's a bit tricky as having lots of distinct strings in memory ready to invoke the comparisons on, and selecting them in a way that the optimiser can't predict at compile time that's still fast enough not to overwhelm and obscure the impact of the string comparison code, is not an easy task. Further, beyond a point - comparing things spread across more memory makes cache affects more relevant to the benchmark, which further obscures the real comparison performance.
Still, if I were you I'd read some strings from a file - pushing each to a vector
, then loop over the vector
doing each of the three comparison operations between adjacent elements. Then the compiler can't possibly predict any pattern in the outcomes. You might find compare
/ ==
faster/slower than strcmp
for strings often differing in the first character or three, but the other way around for long strings that are equal or only differing near the end, so make sure you try different kinds of input before you conclude you understand the performance profile.
Either your timings are screwy, or your compiler has optimised some of your code out of existence.
Think about it, ten billion operations in 0.000349 milliseconds (I'll use 0.000500 milliseconds, or half a microsecond, to make my calculations easier) means that you're performing twenty trillion operations per second.
Even if one operation could be done in a single clock cycle, that would be 20,000 GHz, a bit beyond the current crop of CPUs, even with their massively optimised pipelines and multiple cores.
And, given that the -O2
optimised figures are more on par with each other ( ==
taking about double the time of compare
), the "code optimised out of existence" possibility is looking far more likely.
The doubling of time could easily be explained as ten billion extra function calls, due to operator==
needing to call compare
to do its work.
As further support, examine the following table, showing figures in milliseconds (third column is simple divide-by-ten scale of second column so that both first and third columns are for a billion iterations):
-O2/1billion -O3/10billion -O3/1billion Improvement
(a) (b) (c = b / 10) (a / c)
============ ============= ============ ===========
oper== 19151 5987 599 32
compare 8319 0.0005 0.00005 166,380,000
It beggars belief that -O3
could speed up the ==
code by a factor of about 32 but manage to speed up the compare
code by a factor of a few hundred million.
I strongly suggest you have a look at the assembler code generated by your compiler (such as with the gcc -S
option) to verify that it's actually doing that work it's claiming to do.
The problem is that the compiler is making a lot of serious optimizations to your code.
Here's the modified code:
#include <iostream>
#include <vector>
#include <string>
#include <stdint.h>
#include "Timer.h"
#include <random>
#include <time.h>
#include <string.h>
using namespace std;
uint64_t itr = 500000000;//10 Billion
int len = 100;
int main() {
srand(time(0));
string s1(len,random()%128);
string s2(len,random()%128);
uint64_t a = 0;
Timer t;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(s1 == s2)
a += i;
}
t.end();
cout<<"== took:"<<t.elapsedMillis()<<",a="<<a<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(s1.compare(s2)==0)
a+=i;
}
t.end();
cout<<".compare took:"<<t.elapsedMillis()<<",a="<<a<<endl;
t.begin();
for(uint64_t i =0;i<itr;i++){
asm volatile("" : "+g"(s2));
if(strcmp(s1.c_str(),s2.c_str()) == 0)
a+=i;
}
t.end();
cout<<"strcmp took:"<<t.elapsedMillis()<<",a="<<a<< endl;
return a;
}
where I've added asm volatile("" : "+g"(s2)); to force the compiler to run the comparison. I've also added <<",a="< to force the compiler to compute a.
The output is now:
== took:10221.5,a=0
.compare took:10739,a=0
strcmp took:9700,a=0
Can you explain why strcmp is faster than .compare which is slower than ==? however, the speed differences are marginal, but significant.
It actually makes sense! :p
The speed analysis below is wrong - thanks to Tony D for pointing out my error. The criticisms and advice for better benchmarks still apply though.
All the previous answers deal with the compiler optimisation issues in your benchmark, but don't answer why strcmp
is still slightly faster.
strcmp
is likely faster (in the corrected benchmarks) due to the strings sometimes containing zeros. Since strcmp
uses C-strings it can exit when it comes across the string termination char '\\0'
. std::string::compare()
treats '\\0'
as just another char and continues until the end of the string array.
Since you have non-deterministically seeded the RNG, and only generated two strings, your results will change with every run of the code. (I'd advise against this in benchmarks.) Given the numbers, 28 times out of 128, there ought to be no advantage. 10 times out of 128 you will get more than a 10-fold speed up. And so on.
As well as defeating the compiler's optimiser, I would suggest that, next time, you generate a new string for each comparison iteration, allowing you to average away such effects.
Compiled the code with gcc -O3 -S --std=c++1y
. The result is here . gcc version is:
gcc (Ubuntu 4.9.1-16ubuntu6) 4.9.1
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Look at it, we can be the first loop ( operator ==
) is like this: (comment is added by me)
movq itr(%rip), %rbp
movq %rax, %r12
movq %rax, 56(%rsp)
testq %rbp, %rbp
je .L25
movq 16(%rsp), %rdi
movq 32(%rsp), %rsi
xorl %ebx, %ebx
movq -24(%rsi), %rdx ; length of string1
cmpq -24(%rdi), %rdx ; compare lengths
je .L53 ; compare content only when length is the same
.L10
; end of loop, print out follows
;....
.L53:
.cfi_restore_state
call memcmp ; compare content
xorl %edx, %edx ; zero loop count
.p2align 4,,10
.p2align 3
.L13:
testl %eax, %eax ; check result
cmove %rdx, %rbx ; a = i
addq $1, %rdx ; i++
cmpq %rbp, %rdx ; i < itr?
jne .L13
jmp .L10
; ....
.L25:
xorl %ebx, %ebx
jmp .L10
We can see that operator ==
is inline, only a call to memcmp
is there. And for operator ==
, if the length is different, the content is not compared.
Most importantly, compare is done only once . The loop content only contains i++;
, a=i;
, i<itr;
.
For the second loop ( compare()
):
movq itr(%rip), %r12
movq %rax, %r13
movq %rax, 56(%rsp)
testq %r12, %r12
je .L14
movq 16(%rsp), %rdi
movq 32(%rsp), %rsi
movq -24(%rdi), %rbp
movq -24(%rsi), %r14 ; read and compare length
movq %rbp, %rdx
cmpq %rbp, %r14
cmovbe %r14, %rdx ; save the shorter length of the two string to %rdx
subq %r14, %rbp ; length difference in %rbp
call memcmp ; content is always compared
movl $2147483648, %edx ; 0x80000000 sign extended
addq %rbp, %rdx ; revert the sign bit of %rbp (length difference) and save to %rdx
testl %eax, %eax ; memcmp returned 0?
jne .L14 ; no, string different
testl %ebp, %ebp ; memcmp returned 0. Are lengths the same (%ebp == 0)?
jne .L14 ; no, string different
movl $4294967295, %eax ; string compare equal
subq $1, %r12 ; itr - 1
cmpq %rax, %rdx
cmovbe %r12, %rbx ; a = itr - 1
.L14:
; output follows
There no loop at all here.
In compare()
, as it should return plus, minus, or zero based on the comparison, string content is always compared. memcmp
called once.
For the third loop ( strcmp()
), the assembly is the most simple:
movq itr(%rip), %rbp ; itr to %rbp
movq %rax, %r12
movq %rax, 56(%rsp)
testq %rbp, %rbp
je .L16
movq 32(%rsp), %rsi
movq 16(%rsp), %rdi
subq $1, %rbp ; itr - 1 to %rbp
call strcmp
testl %eax, %eax ; test compare result
cmovne %rbp, %rbx ; if not equal, save itr - 1 to %rbx (a)
.L16:
These also no loop at all. strcmp
is called, and if the strings are not equal (as in your code), save itr-1
to a
directly.
So your benchmark cannot test the running time for operator ==
, compare()
or strcmp()
. The are all called only once, not able to show the running time difference.
As to why operator ==
takes the most time, it is because for operator==
, the compiler for some reason did not eliminate the loop. The loop takes time (but the loop does not contain string comparison at all).
And from the assembly shown, we may assume that operator ==
may be fastest, because it won't do string comparison at all if the length of the two strings are different. (Of course, under gcc4.9.1 -O3)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.