简体   繁体   中英

tokenize a c++ string with regex having special characters

I am trying to find the tokens in a string, which has words, numbers, and special chars. I tried the following code:

#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
    string str("The ,quick brown. fox \"99\" named quick_joe!");
    regex reg("[\\s,.!\"]+");
    sregex_token_iterator iter(str.begin(), str.end(), reg, -1), end;
    vector<string> vec(iter, end);
    for (auto a : vec) {
        cout << a << ":";
    }
    cout    << endl;
}

And got the following output:

The:quick:brown:fox:99:named:quick_joe:

But I wanted the output:

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

What regex should I use for that? I would like to stick to the standard c++ if possible, ie I would not like a solution with boost.

(See 43594465 for a java version of this question, but now I am looking for a c++ solution. So essentially, the question is how to map Java's Matcher and Pattern to C++.)

You're asking to interleave non-matched substrings (submatch -1) with the whole matched substrings (submatch 0), which is slightly different:

sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,0}), end;

This yields:

The: ,:quick: :brown:. :fox: ":99:" :named: :quick_joe:!:

Since you're looking to just drop whitespace, change the regex to consume surrounding whitespace, and add a capture group for the non-whitespace chars. Then, just specify submatch 1 in the iterator, instead of submatch 0:

regex reg("\\s*([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;

Yields:

The:,:quick brown:.:fox:":99:":named quick_joe:!:

Splitting the spaces between adjoining words requires splitting on 'just spaces' too:

regex reg("\\s*\\s|([,.!\"]+)\\s*");

However, you'll end up with empty submatches:

The:::,:quick::brown:.:fox:::":99:":named::quick_joe:!:

Easy enough to drop those:

regex reg("\\s*\\s|([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
vector<string> vec;
copy_if(iter, end, back_inserter(vec), [](const string& x) { return x.size(); });

Finally:

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

If you want to use the approach used in the Java related question, just use a matching approach here, too.

regex reg(R"(\d+|[^\W\d]+|[^\w\s])");
sregex_token_iterator iter(str.begin(), str.end(), reg), end;
vector<string> vec(iter, end);

See the C++ demo . Result: The:,:quick:brown:.:fox:":99:":named:quick_joe:!: . Note this won't match Unicode letters here as \\w ( \\d , and \\s , too) is not Unicode aware in an std::regex .

Pattern details :

  • \\d+ - 1 or more digits
  • | - or
  • [^\\W\\d]+ - 1 or more ASCII letters or _
  • | - or
  • [^\\w\\s] - 1 char other than an ASCII letter/digit, _ and whitespace.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM