迭代字符串中单词的最有效方法

Question

如果我想迭代字符串中的单个单词（由空格分隔），那么显而易见的解决方案是：

std::istringstream s(myString);

std::string word;
while (s >> word)
    do things

然而，这是非常低效的。 在初始化字符串流时复制整个字符串，然后将每个提取的单词一次一个地复制到word变量中（这几乎是第二次复制整个字符串）。 有没有办法改进这个，而无需手动迭代每个字符？

Answer 1

在大多数情况下，复制只占整体成本的很小一部分，因此拥有干净，高度可读的代码变得更加重要。 在极少数情况下，当时间分析器告诉您复制会产生瓶颈时，您可以在标准库的帮助下迭代字符串中的字符。

您可以采用的一种方法是使用std::string::find_first_of和std::string::find_first_not_of成员函数进行迭代，如下所示：

const std::string s = "quick \t\t brown \t fox jumps over the\nlazy dog";
const std::string ws = " \t\r\n";
std::size_t pos = 0;
while (pos != s.size()) {
    std::size_t from = s.find_first_not_of(ws, pos);
    if (from == std::string::npos) {
        break;
    }
    std::size_t to = s.find_first_of(ws, from+1);
    if (to == std::string::npos) {
        to = s.size();
    }
    // If you want an individual word, copy it with substr.
    // The code below simply prints it character-by-character:
    std::cout << "'";
    for (std::size_t i = from ; i != to ; i++) {
        std::cout << s[i];
    }
    std::cout << "'" << std::endl;
    pos = to;
}

演示。

不幸的是，代码变得更难以阅读，所以你应该避免这种改变，或至少推迟它，直到它被要求。

Answer 2

使用boost字符串算法，我们可以按如下方式编写它。 该循环不涉及任何字符串的复制。

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>

int main()
{
    std::string s = "stack over   flow";

    auto it = boost::make_split_iterator( s, boost::token_finder( 
                          boost::is_any_of( " " ), boost::algorithm::token_compress_on ) );
    decltype( it ) end;

    for( ; it != end; ++it ) 
    {
        std::cout << "word: '" << *it << "'\n";
    }

    return 0;
}

使它成为C ++ 11-ish

由于迭代器对现在如此古老，我们可以使用boost.range来定义一些通用辅助函数。 这些最终允许我们使用range-for遍历单词：

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>
#include <boost/range/iterator_range_core.hpp>

template< typename Range >
using SplitRange = boost::iterator_range< boost::split_iterator< typename Range::const_iterator > >;

template< typename Range, typename Finder >
SplitRange< Range > make_split_range( const Range& rng, const Finder& finder )
{
    auto first = boost::make_split_iterator( rng, finder );
    decltype( first ) last;
    return {  first, last };
}

template< typename Range, typename Predicate >
SplitRange< Range > make_token_range( const Range& rng, const Predicate& pred )
{
    return make_split_range( rng, boost::token_finder( pred, boost::algorithm::token_compress_on ) );
}

int main()
{
    std::string str = "stack \tover\r\n  flow";

    for( const auto& substr : make_token_range( str, boost::is_any_of( " \t\r\n" ) ) )
    {
        std::cout << "word: '" << substr << "'\n";
    }

    return 0;
}

演示：

http://coliru.stacked-crooked.com/a/2f4b3d34086cc6ec

Answer 3

如果你想尽可能快地使用它，你需要回到旧的C函数strtok() （或其线程安全的伴侣strtok_r() ）：

const char* kWhiteSpace = " \t\v\n\r";    //whatever you call white space

char* token = std::strtok(myString.data(), kWhiteSpace);
while(token) {
    //do things with token
    token = std::strtok(nullptr, kWhiteSpace));
}

请注意，这将破坏myString的内容：它通过用终止空字节替换每个标记之后的第一个分隔符字符，并依次返回指向标记开头的指针。 毕竟这是传统的C函数。

然而，这个弱点也是它的优势：它不执行任何复制，也不分配任何动态内存（这可能是示例代码中最耗时的事情）。 因此，您将找不到比strtok()更快的本机C ++方法。

Answer 4

拆分字符串怎么样？ 您可以查看此帖子以获取更多信息。

在这篇文章中，有一个关于如何在标记中拆分字符串的详细答案。 在这个答案中，您可以使用迭代器和复制算法检查第二种方式。

迭代字符串中单词的最有效方法

问题描述

4 个解决方案

解决方案1
5 已采纳 2017-02-22 17:29:53

解决方案2
0 2017-02-23 00:21:39

解决方案3
0 2017-02-23 00:57:05

解决方案4
-1 2017-02-22 17:55:58

迭代字符串中单词的最有效方法

问题描述

4 个解决方案

解决方案1 5 已采纳 2017-02-22 17:29:53

解决方案2 0 2017-02-23 00:21:39

解决方案3 0 2017-02-23 00:57:05

解决方案4 -1 2017-02-22 17:55:58

解决方案1
5 已采纳 2017-02-22 17:29:53

解决方案2
0 2017-02-23 00:21:39

解决方案3
0 2017-02-23 00:57:05

解决方案4
-1 2017-02-22 17:55:58