使用具有特殊字符的正则表达式标记化c ++字符串

Question

I am trying to find the tokens in a string, which has words, numbers, and special chars. 我试图找到一个字符串中的标记，其中包含单词，数字和特殊字符。 I tried the following code: 我尝试了以下代码：

#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
    string str("The ,quick brown. fox \"99\" named quick_joe!");
    regex reg("[\\s,.!\"]+");
    sregex_token_iterator iter(str.begin(), str.end(), reg, -1), end;
    vector<string> vec(iter, end);
    for (auto a : vec) {
        cout << a << ":";
    }
    cout    << endl;
}

And got the following output: 得到以下输出：

The:quick:brown:fox:99:named:quick_joe:

But I wanted the output: 但我想要输出：

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

What regex should I use for that? 我应该使用什么正则表达式？ I would like to stick to the standard c++ if possible, ie I would not like a solution with boost. 如果可能的话，我想坚持使用标准的c ++，即我不喜欢使用boost的解决方案。

(See 43594465 for a java version of this question, but now I am looking for a c++ solution. So essentially, the question is how to map Java's Matcher and Pattern to C++.) （有关此问题的java版本，请参阅43594465 ，但现在我正在寻找一个c ++解决方案。基本上，问题是如何将Java的Matcher和Pattern映射到C ++。）

Answer 1

You're asking to interleave non-matched substrings (submatch -1) with the whole matched substrings (submatch 0), which is slightly different: 您要求将不匹配的子串（子匹配-1）与整个匹配的子串（子匹配0）交错，这略有不同：

sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,0}), end;

This yields: 这会产生：

The: ,:quick: :brown:. :fox: ":99:" :named: :quick_joe:!:

Since you're looking to just drop whitespace, change the regex to consume surrounding whitespace, and add a capture group for the non-whitespace chars. 由于您只想删除空格，因此请更改正则表达式以消耗周围的空白，并为非空白字符添加捕获组。 Then, just specify submatch 1 in the iterator, instead of submatch 0: 然后，只需在迭代器中指定子匹配1，而不是子匹配0：

regex reg("\\s*([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;

Yields: 产量：

The:,:quick brown:.:fox:":99:":named quick_joe:!:

Splitting the spaces between adjoining words requires splitting on 'just spaces' too: 拆分相邻单词之间的空格也需要拆分“只是空格”：

regex reg("\\s*\\s|([,.!\"]+)\\s*");

However, you'll end up with empty submatches: 但是，您最终会得到空的子匹配：

The:::,:quick::brown:.:fox:::":99:":named::quick_joe:!:

Easy enough to drop those: 容易丢弃那些：

regex reg("\\s*\\s|([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
vector<string> vec;
copy_if(iter, end, back_inserter(vec), [](const string& x) { return x.size(); });

Finally: 最后：

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

Answer 2

If you want to use the approach used in the Java related question, just use a matching approach here, too. 如果您想使用Java相关问题中使用的方法，也可以在这里使用匹配方法。

regex reg(R"(\d+|[^\W\d]+|[^\w\s])");
sregex_token_iterator iter(str.begin(), str.end(), reg), end;
vector<string> vec(iter, end);

See the C++ demo . 请参阅C ++演示。 Result: The:,:quick:brown:.:fox:":99:":named:quick_joe:!: . 结果：： The:,:quick:brown:.:fox:":99:":named:quick_joe:!: 。 Note this won't match Unicode letters here as \\w ( \\d , and \\s , too) is not Unicode aware in an std::regex . 请注意，这与Unicode字母不匹配，因为\\w （ \\d和\\s ）在std::regex不能识别Unicode。

Pattern details : 图案细节 ：

\\d+ - 1 or more digits \\d+ - 1位或更多位数
| - or - 要么
[^\\W\\d]+ - 1 or more ASCII letters or _ [^\\W\\d]+ - 1个或多个ASCII字母或_
| - or - 要么
[^\\w\\s] - 1 char other than an ASCII letter/digit, _ and whitespace. [^\\w\\s] - 除了ASCII字母/数字， _和空格之外的1个字符。

使用具有特殊字符的正则表达式标记化c ++字符串

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-04-26 07:47:43

解决方案2
1 2017-04-26 07:47:13

使用具有特殊字符的正则表达式标记化c ++字符串

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-04-26 07:47:43

解决方案2 1 2017-04-26 07:47:13

解决方案1
3 已采纳 2017-04-26 07:47:43

解决方案2
1 2017-04-26 07:47:13