Most efficient way to iterate over words in a string

Question

If I wanted to iterate over individual words in a string (separated by whitespace), then the obvious solution would be:

std::istringstream s(myString);

std::string word;
while (s >> word)
    do things

However that's quite inefficient. The entire string is copied while initializing the string stream, and then each extracted word is copied one at a time into the word variable (which is close to copying the entire string for a second time). Is there a way to improve on this without manually iterating over each character?

Answer 1

In most cases, copying represents a very small percentage of the overall costs, so having a clean, highly readable code becomes more important. In rare cases when the time profiler tells you that copying creates a bottleneck, you can iterate over characters in the string with some help from the standard library.

One approach that you could take is to iterate with std::string::find_first_of and std::string::find_first_not_of member functions, like this:

const std::string s = "quick \t\t brown \t fox jumps over the\nlazy dog";
const std::string ws = " \t\r\n";
std::size_t pos = 0;
while (pos != s.size()) {
    std::size_t from = s.find_first_not_of(ws, pos);
    if (from == std::string::npos) {
        break;
    }
    std::size_t to = s.find_first_of(ws, from+1);
    if (to == std::string::npos) {
        to = s.size();
    }
    // If you want an individual word, copy it with substr.
    // The code below simply prints it character-by-character:
    std::cout << "'";
    for (std::size_t i = from ; i != to ; i++) {
        std::cout << s[i];
    }
    std::cout << "'" << std::endl;
    pos = to;
}

Demo.

Unfortunately, the code becomes a lot harder to read, so you should avoid this change, or at least postpone it until it becomes requried.

Answer 2

Using boost string algorithms we can write it as follows. The loop doesn't involve any copying of the string.

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>

int main()
{
    std::string s = "stack over   flow";

    auto it = boost::make_split_iterator( s, boost::token_finder( 
                          boost::is_any_of( " " ), boost::algorithm::token_compress_on ) );
    decltype( it ) end;

    for( ; it != end; ++it ) 
    {
        std::cout << "word: '" << *it << "'\n";
    }

    return 0;
}

Making it C++11-ish

Since pairs of iterators are so oldschool nowadays, we may use boost.range to define some generic helper functions. These finally allow us to loop over the words using range-for:

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>
#include <boost/range/iterator_range_core.hpp>

template< typename Range >
using SplitRange = boost::iterator_range< boost::split_iterator< typename Range::const_iterator > >;

template< typename Range, typename Finder >
SplitRange< Range > make_split_range( const Range& rng, const Finder& finder )
{
    auto first = boost::make_split_iterator( rng, finder );
    decltype( first ) last;
    return {  first, last };
}

template< typename Range, typename Predicate >
SplitRange< Range > make_token_range( const Range& rng, const Predicate& pred )
{
    return make_split_range( rng, boost::token_finder( pred, boost::algorithm::token_compress_on ) );
}

int main()
{
    std::string str = "stack \tover\r\n  flow";

    for( const auto& substr : make_token_range( str, boost::is_any_of( " \t\r\n" ) ) )
    {
        std::cout << "word: '" << substr << "'\n";
    }

    return 0;
}

Demo:

http://coliru.stacked-crooked.com/a/2f4b3d34086cc6ec

Answer 3

If you want to have it as fast as possible, you need to fall back to the good old C function strtok() (or its thread-safe companion strtok_r() ):

const char* kWhiteSpace = " \t\v\n\r";    //whatever you call white space

char* token = std::strtok(myString.data(), kWhiteSpace);
while(token) {
    //do things with token
    token = std::strtok(nullptr, kWhiteSpace));
}

Beware that this will clobber the contents of myString : It works by replacing the first delimiter character after each token with a terminating null byte, and returning a pointer to the start of the tokens in turn. This is a legacy C function after all.

However, that weakness is also its strength: It does not perform any copy, nor does it allocate any dynamic memory (which likely is the most time consuming thing in your example code). As such, you won't find a native C++ method that beats strtok() 's speed.

Answer 4

What about spliting the string? You can check this post for more information.

Inside this post there is a detailed answer about how to split a string in tokens. In this answer maybe you could check the second way using iterators and the copy algorithm.

Most efficient way to iterate over words in a string

Question

4 answers

solution1
5 ACCPTED 2017-02-22 17:29:53

solution2
0 2017-02-23 00:21:39

solution3
0 2017-02-23 00:57:05

solution4
-1 2017-02-22 17:55:58

Most efficient way to iterate over words in a string

Question

4 answers

solution1 5 ACCPTED 2017-02-22 17:29:53

solution2 0 2017-02-23 00:21:39

solution3 0 2017-02-23 00:57:05

solution4 -1 2017-02-22 17:55:58

solution1
5 ACCPTED 2017-02-22 17:29:53

solution2
0 2017-02-23 00:21:39

solution3
0 2017-02-23 00:57:05

solution4
-1 2017-02-22 17:55:58