简体   繁体   English

C ++正则表达式:获取SubMatch匹配的Capture Group的索引

[英]C++ regex: Get index of the Capture Group the SubMatch matched to

Context . 背景 I'm developing a Lexer/Tokenizing engine, which would use regex as a backend. 我正在开发一个Lexer / Tokenizing引擎,它将使用正则表达式作为后端。 The lexer accepts rules, which define the token types/IDs, eg 词法分析器接受定义令牌类型/ ID的规则,例如

<identifier> = "\\\\b\\\\w+\\\\b" . <identifier> = "\\\\b\\\\w+\\\\b"

As I envision, to do the regex match-based tokenizing, all of the rules defined by regexes are enclosed in capturing groups, and all groups are separated by ORs. 正如我所设想的,要进行基于正则表达式匹配的标记化,正则表达式定义的所有规则都包含在捕获组中,并且所有组都由OR分隔。

When the matching is being executed, every match we produce must have an index of the capturing group it was matched to . 当正在执行匹配时,我们生成的每个匹配都必须具有与其匹配的捕获组的索引 We use these IDs to map the matches to token types. 我们使用这些ID将匹配映射到令牌类型。

So the problem of this question arises - how to get the ID of the group ? 所以出现了这个问题的问题 - 如何获得该组的ID

Similar question here, but it does not provide the solution to my specific problem. 类似的问题在这里,但它没有提供我的具体问题的解决方案。

Exactly my problem here , but it's in JS, and I need a C/C++ solution. 正是我的问题在这里 ,但它是在JS,我需要一个C / C ++的解决方案。

So let's say I've got a regex, made up of capturing groups separated by an OR: 所以,假设我有一个正则表达式,由捕获由OR分隔的组组成:

(\\\\b[a-zA-Z]+\\\\b)|(\\\\b\\\\d+\\\\b)

which matches the the whole numbers or alpha-words. 它匹配整数或字母。

My problem requires that the index of the capture group the regex submatch matched to could be known, eg when matching the string 我的问题要求匹配的正则表达式子匹配的捕获组的索引可以是已知的,例如在匹配字符串时

foo bar 123

3 iterations will be done. 将完成3次迭代。 The group indexes of the matches of every iteration would be 0 0 1 , because the first two matches matched the first capturing group, and the last match matched the second capturing group. 每次迭代的匹配的组索引将是0 0 1 ,因为前两个匹配匹配第一个捕获组,最后一个匹配匹配第二个捕获组。

I know that in standard std::regex library it's not entirely possible ( regex_token_iterator is not a solution, because I don't need to skip any matches). 我知道在标准的std::regex库中,它不是完全可能的( regex_token_iterator不是解决方案,因为我不需要跳过任何匹配)。

I don't have much knowledge about boost::regex or PCRE regex library. 我对boost::regex或PCRE正则表达式库不太了解。

What is the best way to accomplish this task? 完成此任务的最佳方法是什么? Which is the library and method to use? 哪个库和方法使用?

You may use the sregex_iterator to get all matches, and once there is a match you may analyze the std::match_results structure and only grab the ID-1 value of the group that is not empty (only one group that matched will be non-empty): 您可以使用sregex_iterator获取所有匹配项,一旦匹配,您可以分析std::match_results结构并仅获取非空组的ID-1值(只有一个匹配的组将是非空):

std::regex r(R"((\b[[:alpha:]]+\b)|(\b\d+\b))");
std::string s = "foo bar 123";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
                         i != std::sregex_iterator();
                         ++i)
{
    std::smatch m = *i;
    std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';

    for(auto index = 1; index < m.size(); ++index ){
        if (!m[index].str().empty()) {
            std::cout << "Capture group ID: " << index-1 << std::endl;
            break;
        }
    }
}

See the C++ demo . 请参阅C ++演示 Output: 输出:

Match value: foo at Position 0
Capture group ID: 0
Match value: bar at Position 4
Capture group ID: 0
Match value: 123 at Position 8
Capture group ID: 1

Note that R"(...)" is a raw string literal, no need to double backslashes inside it. 请注意, R"(...)"是原始字符串文字,不需要在其中加倍反斜杠。

Also, index is set to 1 at the start of the for loop because the 0th group is the whole match, but you want group IDs to be zero-based, that is why 1 is subtracted later. 此外, indexfor循环开始时设置为1 ,因为第0组是整个匹配,但是您希望组ID从零开始,这就是为什么稍后减去1原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM