简体   繁体   English

boost::tokenizer 与 boost::split

[英]boost::tokenizer vs boost::split

I am trying to parse a c++ string on every '^' character into a vector tokens.我试图将每个 '^' 字符上的 C++ 字符串解析为向量标记。 I have always used the boost::split method, but I am now writing performance critical code and would like to know which one gives better performance.我一直使用 boost::split 方法,但我现在正在编写性能关键代码,想知道哪一个能提供更好的性能。

For example:例如:

string message = "A^B^C^D";
vector<string> tokens;
boost::split(tokens, message, boost::is_any_of("^"));

vs.对比

boost::char_separator<char> sep("^");
boost::tokenizer<boost::char_separator<char> > tokens(text, sep);

Which one would give better performance and why?哪一个会提供更好的性能,为什么?

The best choice depends on a few factors.最佳选择取决于几个因素。 If you're only needing to scan the tokens once, then the boost::tokenizer is a good choice in both runtime and space performance (those vectors of tokens can take up a lot of space, depending on input data.)如果您只需要扫描令牌一次,那么 boost::tokenizer 在运行时和空间性能方面都是不错的选择(这些令牌向量可能会占用大量空间,具体取决于输入数据。)

If you're going to be scanning the tokens often, or need a vector with efficient random access, then the boost::split into a vector may be the better option.如果您要经常扫描令牌,或者需要具有高效随机访问的向量,那么 boost::split 为向量可能是更好的选择。

For example, in your "A^B^C^...^Z" input string where the tokens are 1-byte in length, the boost::split/vector<string> method will consume at least 2*N-1 bytes.例如,在您的“A^B^C^...^Z”输入字符串中,其中标记的长度为 1 个字节, boost::split/vector<string>方法将至少消耗 2*N-1字节。 With the way strings are stored in most STL implementations you can figure it taking more than 8x that count.通过在大多数 STL 实现中存储字符串的方式,您可以计算出它需要超过 8 倍的数量。 Storing these strings in a vector is costly in terms of memory and time.将这些字符串存储在向量中的内存和时间成本很高。

I ran a quick test on my machine and a similar pattern with 10 million tokens looked like this:我在我的机器上运行了一个快速测试,一个包含 1000 万个令牌的类似模式如下所示:

  • boost::split = 2.5s and ~620MB boost::split = 2.5s~620MB
  • boost::tokenizer = 0.9s and 0MB boost::tokenizer = 0.9s0MB

If you're just doing a one-time scan of the tokens, then clearly the tokenizer is better.如果您只是对令牌进行一次性扫描,那么显然令牌生成器更好。 But, if you're shredding into a structure that you want to reuse during the lifetime of your application, then having a vector of tokens may be preferred.但是,如果您要分解成想要在应用程序的生命周期内重复使用的结构,那么可能更喜欢使用令牌向量。

If you want to go the vector route, then I'd recommend not using a vector<string> , but a vector of string::iterators instead.如果你想走向量路线,那么我建议不要使用vector<string> ,而是使用 string::iterators 的向量。 Just shred into a pair of iterators and keep around your big string of tokens for reference.只需将其分解为一对迭代器并保留一大串标记以供参考。 For example:例如:

using namespace std;
vector<pair<string::const_iterator,string::const_iterator> > tokens;
boost::split(tokens, s, boost::is_any_of("^"));
for(auto beg=tokens.begin(); beg!=tokens.end();++beg){
   cout << string(beg->first,beg->second) << endl;
}

This improved version takes 1.6s and 390MB on the same server and test.这个改进的版本在同一台服务器上需要1.6s390MB并进行测试。 And, best of all the memory overhead of this vector is linear with the number of tokens -- not dependent in any way on the length of tokens, whereas a std::vector<string> stores each token.而且,最重要的是,这个向量的内存开销与标记的数量成线性关系——不以任何方式依赖于标记的长度,而std::vector<string>存储每个标记。

I find rather different results using clang++ -O3 -std=c++11 -stdlib=libc++ .我发现使用clang++ -O3 -std=c++11 -stdlib=libc++结果相当不同。

First I extracted a text file with ~470k words separated by commas with no newlines into a giant string, like so:首先,我将一个包含约 470k 个单词的文本文件提取到一个巨大的字符串中,由逗号分隔,没有换行符,如下所示:

path const inputPath("input.txt");

filebuf buf;
buf.open(inputPath.string(),ios::in);
if (!buf.is_open())
    return cerr << "can't open" << endl, 1;

string str(filesystem::file_size(inputPath),'\0');
buf.sgetn(&str[0], str.size());
buf.close();

Then I ran various timed tests storing results into a pre-sized vector cleared between runs, for example,然后我运行了各种定时测试,将结果存储到在运行之间清除的预定大小的向量中,例如,

void vectorStorage(string const& str)
{
    static size_t const expectedSize = 471785;

    vector<string> contents;
    contents.reserve(expectedSize+1);

    ...

    {
        timed _("split is_any_of");
        split(contents, str, is_any_of(","));
    }
    if (expectedSize != contents.size()) throw runtime_error("bad size");
    contents.clear();

    ...
}

For reference, the timer is just this:作为参考,计时器就是这样的:

struct timed
{
    ~timed()
    {
        auto duration = chrono::duration_cast<chrono::duration<double, ratio<1,1000>>>(chrono::high_resolution_clock::now() - start_);

        cout << setw(40) << right << name_ << ": " << duration.count() << " ms" << endl;
    }

    timed(std::string name="") :
        name_(name)
    {}


    chrono::high_resolution_clock::time_point const start_ = chrono::high_resolution_clock::now();
    string const name_;
};

I also clocked a single iteration (no vector).我还记录了一次迭代(无向量)。 Here are the results:结果如下:

Vector: 
                              hand-coded: 54.8777 ms
                         split is_any_of: 67.7232 ms
                     split is_from_range: 49.0215 ms
                               tokenizer: 119.37 ms
One iteration:
                               tokenizer: 97.2867 ms
                          split iterator: 26.5444 ms
            split iterator back_inserter: 57.7194 ms
                split iterator char copy: 34.8381 ms

The tokenizer is so much slower than split , the one-iteration figure doesn't even include the string copy:分词器比split得多,一次迭代数字甚至不包括字符串副本:

{
    string word;
    word.reserve(128);

    timed _("tokenizer");
    boost::char_separator<char> sep(",");
    boost::tokenizer<boost::char_separator<char> > tokens(str, sep);

    for (auto range : tokens)
    {}
}

{
    string word;

    timed _("split iterator");
    for (auto it = make_split_iterator(str, token_finder(is_from_range(',', ',')));
         it != decltype(it)(); ++it)
    {
        word = move(copy_range<string>(*it));
    }
}

Unambiguous conclusion: use split .明确的结论:使用split

It might depend on your version of boost and how you're the functionality.这可能取决于您的 boost 版本以及您的功能。

We had a performance issue in some logic that was using boost::split 1.41.0 to handle thousands or hundreds of thousands of smaller strings (expected less than 10 tokens).我们在使用 boost::split 1.41.0 处理数千或数十万个较小字符串(预计少于 10 个标记)的某些逻辑中存在性能问题。 When I ran the code through a performance analyzer we found that a surprising 39% amount of time was spent in boost::split.当我通过性能分析器运行代码时,我们发现在 boost::split 上花费了 39% 的时间。

We tried some simple "fixes" that didn't affect performance materially like "we know we wont have more than 10 items on each pass, so preset the vector to 10 items".我们尝试了一些对性能没有实质性影响的简单“修复”,例如“我们知道每次传递不会超过 10 个项目,因此将向量预设为 10 个项目”。

Since we didn't actually need the vector and could just iterate the tokens and accomplish the same job, we changed the code to boost::tokenize and the same section of code dropped to <1% of runtime.由于我们实际上并不需要向量并且可以只迭代标记并完成相同的工作,因此我们将代码更改为 boost::tokenize 并且相同的代码部分降低到运行时的 <1%。

Processing the tokens as you produce them is the key.在生成令牌时处理令牌是关键。 I have a setup with a regex and it seems to be as fast as boost::tokenizer.我有一个带有正则表达式的设置,它似乎和 boost::tokenizer 一样快。 If I store the matches in a vector its at least 50 times slower如果我将匹配项存储在向量中,它至少慢 50 倍

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM