简体   繁体   中英

How to capture repeated patterns using regex tokenization in C++

My problem is with respect to regex tokenization in C++.

I have the following regex pattern:

**const regex PRx ("^ TxD = <(@&[^(@&)]{1,32}@&){2,}>;");**

And the following string objects which I'm reading from a file (there can be many of those strings):

 TxD = <@&Mag@&@&Hael@&@&Io12n@&>;

 TxD = <@&Atx@&@&Depoc@&@&Lsadiz@&@&gfhg@&@&kdkdj@&>;

Note that a space exists at the beginning of each string (as shown in the regex PRx after the anchor ^).

Following code is responsible for parsing above patterns accordingly

vector<DFG> IP; // DFG is a class type

vector<int> MIS;
MIS.push_back(1);

const sregex_token_iterator Endx;

for (sregex_token_iterator IPF(DOC.begin(), DOC.end(), PRx, MIS); IPF != Endx;)
{
    string SIN = (*IPF).str().c_str(); 
    IPF++;      
    IP.push_back(DFG(SIN));  /* The constructor of DFG is responsible for pushing SIN to a 
                                 vector data member object of string type */ 
}

As shown in the regex pattern PRx, it attempts to capture all patterns that are enclosed between the delimiter "@&"; however, the problem is that it is capturing only the last matched pattern. For example, in the first string, it would report only "@&Io12n@&", and in the second string, it reports only "@&kdkdj@&".

The expected output from the first string is (for illustration purpose):

@&Mag@&
@&Hael@&
@&Io12n@&

And from the second string is (for illustration purpose):

@&Atx@&
@&Depoc@&
@&Lsadiz@&
@&gfhg@&
@&kdkdj@&

(Note that the output shown above is not to be displayed but rather it is such that each pattern found is to be saved separately in the SIN vector object)

It would only work if I removed the patterns "^ TxD = <" and ">;" and the range check "{2,}" from PRx, and I don't want to do that. I'm not sure why its failing to capture all patterns! Could you please provide your thoughts and evaluation on the matter.

Thank you!

If your regex engine supports \\G you could use this pattern

(?:^\sTxD\s=\s<(?=(?1){2,}>;$)|\G)(@&[^@&]{1,32}@&)(?=(?1)|>;$)

Demo


Based on your comment below, I think you need to use two different regex patterns, the first one to filter your data based on the criteria ^ TxD , {2,} , {1,32} and >; first, like so

^\sTxD\s=\s(?=<((@&[^@&]{1,32}@&){2,})>;$)

Demo
and perform another easy pattern on match #1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM