简体   繁体   中英

Multi-threaded performance std::string

We are running some code on a project that uses OpenMP and I've run into something strange. I've included parts of some play code that demonstrates what I see.

The tests compare calling a function with a const char* argument with a std::string argument in a multi-threaded loop. The functions essentially do nothing and so have no overhead.

What I do see is a major difference in the time it takes to complete the loops. For the const char* version doing 100,000,000 iterations the code takes 0.075 seconds to complete compared with 5.08 seconds for the std::string version. These tests were done on Ubuntu-10.04-x64 with gcc-4.4.

My question is basically whether this is solely due the dynamic allocation of std::string and why in this case that can't be optimized away since it is const and can't change?

Code below and many thanks for your responses.

Compiled with: g++ -Wall -Wextra -O3 -fopenmp string_args.cpp -o string_args

#include <iostream>
#include <map>
#include <string>
#include <stdint.h>

// For wall time
#ifdef _WIN32
#include <time.h>
#else
#include <sys/time.h>
#endif

namespace
{
  const int64_t g_max_iter = 100000000;
  std::map<const char*, int> g_charIndex = std::map<const char*,int>();
  std::map<std::string, int> g_strIndex = std::map<std::string,int>();

  class Timer
  {
  public:
    Timer()
    {
    #ifdef _WIN32
      m_start = clock();
    #else /* linux & mac */
      gettimeofday(&m_start,0);
    #endif
    }

    float elapsed()
    {
    #ifdef _WIN32
      clock_t now = clock();
      const float retval = float(now - m_start)/CLOCKS_PER_SEC;
      m_start = now;
    #else /* linux & mac */
      timeval now;
      gettimeofday(&now,0);
      const float retval = float(now.tv_sec - m_start.tv_sec) + float((now.tv_usec - m_start.tv_usec)/1E6);
      m_start = now;
    #endif
      return retval;
    }

  private:
    // The type of this variable is different depending on the platform
#ifdef _WIN32
    clock_t
#else
    timeval
#endif
    m_start;   ///< The starting time (implementation dependent format)
  };

}

bool contains_char(const char * id)
{
  if( g_charIndex.empty() ) return false;
  return (g_charIndex.find(id) != g_charIndex.end());
}

bool contains_str(const std::string & name)
{
  if( g_strIndex.empty() ) return false;
  return (g_strIndex.find(name) != g_strIndex.end());
}

void do_serial_char()
{
  int found(0);
  Timer clock;
  for( int64_t i = 0; i < g_max_iter; ++i )
  {
    if( contains_char("pos") )
    {
     ++found;
    }
  }
  std::cout << "Loop time: " << clock.elapsed() << "\n";
  ++found;
}

void do_parallel_char()
{
  int found(0);
  Timer clock;
#pragma omp parallel for
  for( int64_t i = 0; i < g_max_iter; ++i )
  {
    if( contains_char("pos") )
    {
     ++found;
    }
  }
  std::cout << "Loop time: " << clock.elapsed() << "\n";
  ++found;
}

void do_serial_str()
{
  int found(0);
  Timer clock;
  for( int64_t i = 0; i < g_max_iter; ++i )
  {
    if( contains_str("pos") )
    {
     ++found;
    }
  }
  std::cout << "Loop time: " << clock.elapsed() << "\n";
  ++found;
}

void do_parallel_str()
{
  int found(0);
  Timer clock;
#pragma omp parallel for
  for( int64_t i = 0; i < g_max_iter ; ++i )
  {
    if( contains_str("pos") )
    {
     ++found;
    }
  }
  std::cout << "Loop time: " << clock.elapsed() << "\n";
  ++found;
}

int main()
{
  std::cout << "Starting single-threaded loop using std::string\n";
  do_serial_str();
  std::cout << "\nStarting multi-threaded loop using std::string\n";
  do_parallel_str();

  std::cout << "\nStarting single-threaded loop using char *\n";
  do_serial_char();
  std::cout << "\nStarting multi-threaded loop using const char*\n";
  do_parallel_char();
  }

My question is basically whether this is solely due the dynamic allocation of std::string and why in this case that can't be optimized away since it is const and can't change?

Yes, it is due to the allocation and copying for std::string on every iteration.

A sufficiently smart compiler could potentially optimize this, but it is unlikely to happen with current optimizers. Instead, you can hoist the string yourself:

void do_parallel_str()
{
  int found(0);
  Timer clock;
  std::string const str = "pos";  // you can even make it static, if desired
#pragma omp parallel for
  for( int64_t i = 0; i < g_max_iter; ++i )
  {
    if( contains_str(str) )
    {
      ++found;
    }
  }
  //clock.stop();  // Or use something to that affect, so you don't include
  // any of the below expression (such as outputing "Loop time: ") in the timing.
  std::cout << "Loop time: " << clock.elapsed() << "\n";
  ++found;
}

Does changing:

if( contains_str("pos") )

to:

static const std::string str = "pos";
if( str )

Change things much? My current best guess is that the implicit constructor call for std::string every loop would introduce a fair bit of overhead and optimising it away whilst possible is still a sufficiently hard problem I suspect.

std::string (in your case temporary) requires dynamic allocation, which is a very slow operation, compared to everything else in your loop. There are also old implementations of standard library that did COW, which also slow in multi-threaded environment. Having said that, there is no reason why compiler cannot optimize temporary string creation and optimize away the whole contains_str function call, unless you have some side effects there. Since you didn't provide implementation for that function, it's impossible to say if it could be completely optimized away.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM