简体   繁体   English

迭代字符串中单词的最有效方法

[英]Most efficient way to iterate over words in a string

If I wanted to iterate over individual words in a string (separated by whitespace), then the obvious solution would be: 如果我想迭代字符串中的单个单词(由空格分隔),那么显而易见的解决方案是:

std::istringstream s(myString);

std::string word;
while (s >> word)
    do things

However that's quite inefficient. 然而,这是非常低效的。 The entire string is copied while initializing the string stream, and then each extracted word is copied one at a time into the word variable (which is close to copying the entire string for a second time). 在初始化字符串流时复制整个字符串,然后将每个提取的单词一次一个地复制到word变量中(这几乎是第二次复制整个字符串)。 Is there a way to improve on this without manually iterating over each character? 有没有办法改进这个,而无需手动迭代每个字符?

In most cases, copying represents a very small percentage of the overall costs, so having a clean, highly readable code becomes more important. 在大多数情况下,复制只占整体成本的很小一部分,因此拥有干净,高度可读的代码变得更加重要。 In rare cases when the time profiler tells you that copying creates a bottleneck, you can iterate over characters in the string with some help from the standard library. 在极少数情况下,当时间分析器告诉您复制会产生瓶颈时,您可以在标准库的帮助下迭代字符串中的字符。

One approach that you could take is to iterate with std::string::find_first_of and std::string::find_first_not_of member functions, like this: 您可以采用的一种方法是使用std::string::find_first_ofstd::string::find_first_not_of成员函数进行迭代,如下所示:

const std::string s = "quick \t\t brown \t fox jumps over the\nlazy dog";
const std::string ws = " \t\r\n";
std::size_t pos = 0;
while (pos != s.size()) {
    std::size_t from = s.find_first_not_of(ws, pos);
    if (from == std::string::npos) {
        break;
    }
    std::size_t to = s.find_first_of(ws, from+1);
    if (to == std::string::npos) {
        to = s.size();
    }
    // If you want an individual word, copy it with substr.
    // The code below simply prints it character-by-character:
    std::cout << "'";
    for (std::size_t i = from ; i != to ; i++) {
        std::cout << s[i];
    }
    std::cout << "'" << std::endl;
    pos = to;
}

Demo. 演示。

Unfortunately, the code becomes a lot harder to read, so you should avoid this change, or at least postpone it until it becomes requried. 不幸的是,代码变得更难以阅读,所以你应该避免这种改变,或至少推迟它,直到它被要求。

Using boost string algorithms we can write it as follows. 使用boost字符串算法,我们可以按如下方式编写它。 The loop doesn't involve any copying of the string. 该循环不涉及任何字符串的复制。

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>

int main()
{
    std::string s = "stack over   flow";

    auto it = boost::make_split_iterator( s, boost::token_finder( 
                          boost::is_any_of( " " ), boost::algorithm::token_compress_on ) );
    decltype( it ) end;

    for( ; it != end; ++it ) 
    {
        std::cout << "word: '" << *it << "'\n";
    }

    return 0;
}

Making it C++11-ish 使它成为C ++ 11-ish

Since pairs of iterators are so oldschool nowadays, we may use boost.range to define some generic helper functions. 由于迭代器对现在如此古老 ,我们可以使用boost.range来定义一些通用辅助函数。 These finally allow us to loop over the words using range-for: 这些最终允许我们使用range-for遍历单词:

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>
#include <boost/range/iterator_range_core.hpp>

template< typename Range >
using SplitRange = boost::iterator_range< boost::split_iterator< typename Range::const_iterator > >;

template< typename Range, typename Finder >
SplitRange< Range > make_split_range( const Range& rng, const Finder& finder )
{
    auto first = boost::make_split_iterator( rng, finder );
    decltype( first ) last;
    return {  first, last };
}

template< typename Range, typename Predicate >
SplitRange< Range > make_token_range( const Range& rng, const Predicate& pred )
{
    return make_split_range( rng, boost::token_finder( pred, boost::algorithm::token_compress_on ) );
}

int main()
{
    std::string str = "stack \tover\r\n  flow";

    for( const auto& substr : make_token_range( str, boost::is_any_of( " \t\r\n" ) ) )
    {
        std::cout << "word: '" << substr << "'\n";
    }

    return 0;
}

Demo: 演示:

http://coliru.stacked-crooked.com/a/2f4b3d34086cc6ec http://coliru.stacked-crooked.com/a/2f4b3d34086cc6ec

If you want to have it as fast as possible, you need to fall back to the good old C function strtok() (or its thread-safe companion strtok_r() ): 如果你想尽可能快地使用它,你需要回到旧的C函数strtok() (或其线程安全的伴侣strtok_r() ):

const char* kWhiteSpace = " \t\v\n\r";    //whatever you call white space

char* token = std::strtok(myString.data(), kWhiteSpace);
while(token) {
    //do things with token
    token = std::strtok(nullptr, kWhiteSpace));
}

Beware that this will clobber the contents of myString : It works by replacing the first delimiter character after each token with a terminating null byte, and returning a pointer to the start of the tokens in turn. 请注意,这将破坏myString的内容:它通过用终止空字节替换每个标记之后的第一个分隔符字符,并依次返回指向标记开头的指针。 This is a legacy C function after all. 毕竟这是传统的C函数。

However, that weakness is also its strength: It does not perform any copy, nor does it allocate any dynamic memory (which likely is the most time consuming thing in your example code). 然而,这个弱点也是它的优势:它不执行任何复制,也不分配任何动态内存(这可能是示例代码中最耗时的事情)。 As such, you won't find a native C++ method that beats strtok() 's speed. 因此,您将找不到比strtok()更快的本机C ++方法。

What about spliting the string? 拆分字符串怎么样? You can check this post for more information. 您可以查看此帖子以获取更多信息。

Inside this post there is a detailed answer about how to split a string in tokens. 在这篇文章中,有一个关于如何在标记中拆分字符串的详细答案。 In this answer maybe you could check the second way using iterators and the copy algorithm. 在这个答案中,您可以使用迭代器和复制算法检查第二种方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM