简体   繁体   中英

Most efficient way to iterate over words in a string

If I wanted to iterate over individual words in a string (separated by whitespace), then the obvious solution would be:

std::istringstream s(myString);

std::string word;
while (s >> word)
    do things

However that's quite inefficient. The entire string is copied while initializing the string stream, and then each extracted word is copied one at a time into the word variable (which is close to copying the entire string for a second time). Is there a way to improve on this without manually iterating over each character?

In most cases, copying represents a very small percentage of the overall costs, so having a clean, highly readable code becomes more important. In rare cases when the time profiler tells you that copying creates a bottleneck, you can iterate over characters in the string with some help from the standard library.

One approach that you could take is to iterate with std::string::find_first_of and std::string::find_first_not_of member functions, like this:

const std::string s = "quick \t\t brown \t fox jumps over the\nlazy dog";
const std::string ws = " \t\r\n";
std::size_t pos = 0;
while (pos != s.size()) {
    std::size_t from = s.find_first_not_of(ws, pos);
    if (from == std::string::npos) {
        break;
    }
    std::size_t to = s.find_first_of(ws, from+1);
    if (to == std::string::npos) {
        to = s.size();
    }
    // If you want an individual word, copy it with substr.
    // The code below simply prints it character-by-character:
    std::cout << "'";
    for (std::size_t i = from ; i != to ; i++) {
        std::cout << s[i];
    }
    std::cout << "'" << std::endl;
    pos = to;
}

Demo.

Unfortunately, the code becomes a lot harder to read, so you should avoid this change, or at least postpone it until it becomes requried.

Using boost string algorithms we can write it as follows. The loop doesn't involve any copying of the string.

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>

int main()
{
    std::string s = "stack over   flow";

    auto it = boost::make_split_iterator( s, boost::token_finder( 
                          boost::is_any_of( " " ), boost::algorithm::token_compress_on ) );
    decltype( it ) end;

    for( ; it != end; ++it ) 
    {
        std::cout << "word: '" << *it << "'\n";
    }

    return 0;
}

Making it C++11-ish

Since pairs of iterators are so oldschool nowadays, we may use boost.range to define some generic helper functions. These finally allow us to loop over the words using range-for:

#include <string>
#include <iostream>
#include <boost/algorithm/string.hpp>
#include <boost/range/iterator_range_core.hpp>

template< typename Range >
using SplitRange = boost::iterator_range< boost::split_iterator< typename Range::const_iterator > >;

template< typename Range, typename Finder >
SplitRange< Range > make_split_range( const Range& rng, const Finder& finder )
{
    auto first = boost::make_split_iterator( rng, finder );
    decltype( first ) last;
    return {  first, last };
}

template< typename Range, typename Predicate >
SplitRange< Range > make_token_range( const Range& rng, const Predicate& pred )
{
    return make_split_range( rng, boost::token_finder( pred, boost::algorithm::token_compress_on ) );
}

int main()
{
    std::string str = "stack \tover\r\n  flow";

    for( const auto& substr : make_token_range( str, boost::is_any_of( " \t\r\n" ) ) )
    {
        std::cout << "word: '" << substr << "'\n";
    }

    return 0;
}

Demo:

http://coliru.stacked-crooked.com/a/2f4b3d34086cc6ec

If you want to have it as fast as possible, you need to fall back to the good old C function strtok() (or its thread-safe companion strtok_r() ):

const char* kWhiteSpace = " \t\v\n\r";    //whatever you call white space

char* token = std::strtok(myString.data(), kWhiteSpace);
while(token) {
    //do things with token
    token = std::strtok(nullptr, kWhiteSpace));
}

Beware that this will clobber the contents of myString : It works by replacing the first delimiter character after each token with a terminating null byte, and returning a pointer to the start of the tokens in turn. This is a legacy C function after all.

However, that weakness is also its strength: It does not perform any copy, nor does it allocate any dynamic memory (which likely is the most time consuming thing in your example code). As such, you won't find a native C++ method that beats strtok() 's speed.

What about spliting the string? You can check this post for more information.

Inside this post there is a detailed answer about how to split a string in tokens. In this answer maybe you could check the second way using iterators and the copy algorithm.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM