使用多個分隔符進行快速字符串拆分

Question

我在StackOverflow上調查了一段時間，找到了將多個分隔符的字符串拆分成vector< string >好算法。 我還發現了一些方法：

推動方式：

boost::split(vector, string, boost::is_any_of(" \t"));

getline方法：

std::stringstream ss(string);
std::string item;
while(std::getline(ss, item, ' ')) {
    vector.push_back(item);
}

Boost的標記化方式：

char_separator<char> sep(" \t");
tokenizer<char_separator<char>> tokens(string, sep);
BOOST_FOREACH(string t, tokens)
{
   vector.push_back(t);
}

和酷STL方式：

     istringstream iss(string);
     copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     back_inserter<vector<string> >(vector));

和Shadow2531的方法（參見鏈接主題）。

他們中的大多數來自這個主題。 但不幸的是他們沒有解決我的問題：

Boost的分裂很容易使用，但是大數據（在最好的情況下大約1.5 * 10 ^ 6單個元素）和大約10個分隔符我使用它的可怕的慢。
getline ，STL和Shadow2531的方法存在的問題是我只能使用一個char作為分隔符。 我需要更多。
在速度方面，Boost的標記化更加可怕。 用10個分隔符花了11秒鍾將一個字符串分成1.5 * 10 ^ 6個元素。

所以我不知道該怎么做：我希望有一個非常快速的字符串拆分算法和多個分隔符。

Boost的分裂最大還是有辦法更快地完成它？

Answer 1

我想到兩件事：

使用字符串視圖而不是字符串作為拆分結果，可以節省大量分配。
如果您知道您將只使用字符（在[0,255]范圍內），請嘗試使用bitset來測試成員資格，而不是find分隔符字符。

以下是應用這些想法的快速嘗試：

#include <vector>
#include <bitset>
#include <iostream>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <boost/timer.hpp>

using namespace std;
size_t const N = 10000000;

template<typename C>
void test_custom(string const& s, char const* d, C& ret)
{
  C output;

  bitset<255> delims;
  while( *d )
  {
    unsigned char code = *d++;
    delims[code] = true;
  }
  typedef string::const_iterator iter;
  iter beg;
  bool in_token = false;
  for( string::const_iterator it = s.begin(), end = s.end();
    it != end; ++it )
  {
    if( delims[*it] )
    {
      if( in_token )
      {
        output.push_back(typename C::value_type(beg, it));
        in_token = false;
      }
    }
    else if( !in_token )
    {
      beg = it;
      in_token = true;
    }
  }
  if( in_token )
    output.push_back(typename C::value_type(beg, s.end()));
  output.swap(ret);
}

template<typename C>
void test_strpbrk(string const& s, char const* delims, C& ret)
{
  C output;

  char const* p = s.c_str();
  char const* q = strpbrk(p+1, delims);
  for( ; q != NULL; q = strpbrk(p, delims) )
  {
    output.push_back(typename C::value_type(p, q));
    p = q + 1;
  }

  output.swap(ret);
}

template<typename C>
void test_boost(string const& s, char const* delims)
{
  C output;
  boost::split(output, s, boost::is_any_of(delims));
}

int main()
{
  // Generate random text
  string text(N, ' ');
  for( size_t i = 0; i != N; ++i )
    text[i] = (i % 2 == 0)?('a'+(i/2)%26):((i/2)%2?' ':'\t');

  char const* delims = " \t[],-'/\\!\"§$%&=()<>?";

  // Output strings
  boost::timer timer;
  test_boost<vector<string> >(text, delims);
  cout << "Time: " << timer.elapsed() << endl;

  // Output string views
  typedef string::const_iterator iter;
  typedef boost::iterator_range<iter> string_view;
  timer.restart();
  test_boost<vector<string_view> >(text, delims);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string> vs;
  test_custom(text, delims, vs);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string_view> vsv;
  test_custom(text, delims, vsv);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string> vsp;
  test_strpbrk(text, delims, vsp);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string_view> vsvp;
  test_strpbrk(text, delims, vsvp);
  cout << "Time: " << timer.elapsed() << endl;

  return 0;
}

使用GCC 4.5.1使用-O4標志啟用使用Boost 1.46.1進行編譯時得到：

時間：5.951（Boost.Split +矢量）
時間：3.728（Boost.Split +矢量
時間：1.662（自定義分割+矢量）
時間：0.144（自定義分割+矢量）
時間：2.13（Strpbrk +矢量）
時間：0.527（Strpbrk + vector）

注意：輸出略有不同，因為我的自定義函數會丟棄空標記。 但是，如果您決定使用它，您可以根據需要調整此代碼。

Answer 2

要結合Pablo和larsmans的答案的最佳部分，使用(offset, size)對存儲子串和strcspn來獲取每個條目的范圍。

Answer 3

在如此大的琴弦上，使用繩索可能會有所回報。 或者使用Pablo建議的字符串視圖：（ char const* ， size_t ）對。 如果你有一個很好的strpbrk實現，就沒有必要使用bitset技巧。

使用多個分隔符進行快速字符串拆分

問題描述

3 個解決方案

解決方案1
33 已采納 2011-03-31 20:55:00

解決方案2
2 2011-03-31 22:24:21

解決方案3
1 2011-03-31 21:00:16

使用多個分隔符進行快速字符串拆分

問題描述

3 個解決方案

解決方案1 33 已采納 2011-03-31 20:55:00

解決方案2 2 2011-03-31 22:24:21

解決方案3 1 2011-03-31 21:00:16

解決方案1
33 已采納 2011-03-31 20:55:00

解決方案2
2 2011-03-31 22:24:21

解決方案3
1 2011-03-31 21:00:16