Tokenize a String and Keep Delimiters Using Regular Expression in C++

Question

I would like to modify the given regular expression to produce the following list of matches. I am having a hard time describing the problem in words.

I want to use a regular expression to match a set of 'tokens'. Specifically I want && , || , ; , ( , ) to be matched, and any string that does not contain those characters should be a match. The problem I am having is distinguishing between one pipe and two pipes. How can i produce the desired matches? Thank you a lot for your help!

Link to this example

The expression:

((&{2})|(\|{2})|(\()|(\))|(;)|[^&|;()]+)

Test String

a < b | c | d > e >> f && ((g) || h) ; i

Expected Matches

a < b | c | d > e >> f 
&&

(
(
g
)

||
 h
)

;
 i

Actual Matches

a < b 
|
 c 
|
 d > e >> f 
&&

(
(
g
)

||
 h
)

;
 i

I am trying to implement a custom tokenizer for a program in C++.

Example Code

std::vector<std::string> Parser::tokenizeInput(std::string s) {
    std::vector<std::string> returnTokens;

    //tokenize correctly using this regex
    std::regex rgx(R"S(((&{2})|(\|{2})|(\()|(\))|(;)|[^&|;()]+))S");

    std::regex_iterator<std::string::iterator> rit ( s.begin(), s.end(), rgx );
    std::regex_iterator<std::string::iterator> rend;

    while (rit!=rend) {

        std::string tokenStr = rit->str();

        if(tokenStr.size() > 0 && tokenStr != " "){
            //assure the token is not blank
            //and push the token
            boost::algorithm::trim(tokenStr);
            returnTokens.push_back(tokenStr);
        }

        ++rit;
    }

    return returnTokens;
}

Example Driver Code

//in main
std::vector<std::string> testVec = Parser::tokenizeInput(inputWithNoComments);
std::cout << "input string: " << inputWithNoComments << std::endl;
std::cout << "tokenized string[";
for(unsigned int i = 0; i < testVec.size(); i++){
    std::cout << testVec[i];
    if ( i + 1 < testVec.size() ) { std::cout << ", "; }
}
std::cout << "]" << std::endl;

Produced Output

input string: (cat file > outFile) || ( ls -l | grep -i )
tokenized string[(, cat file > outFile, ), ||, (, ls -l, grep -i, )]

input string: a && b || c > d >> e < f | g
tokenized string[a, &&, b, ||, c > d >> e < f, g]

input string: foo | bar || foo || bar | foo | bar
tokenized string[foo, bar, ||, foo, ||, bar, foo, bar]

What I Want the Output to be

input string: (cat file > outFile) || ( ls -l | grep -i )
tokenized string[(, cat file > outFile, ), ||, (, ls -l | grep -i, )]

input string: a && b || c > d >> e < f | g
tokenized string[a, &&, b, ||, c > d >> e < f | g]

input string: foo | bar || foo || bar | foo | bar
tokenized string[foo | bar, ||, foo, ||, bar | foo | bar]

Answer 1

I suggest a splitting approach by passing {-1,0} to the sregex_token_iterator to collect both non-matched and matched substrings, and using a much simpler regex like &&|\\|\\||[;()] while discarding the empty substrings (that are due to the way strings are split when consecutive matches are found):

std::regex rx(R"(&&|\|\||[();])");
std::string exp = "a < b | c | d > e >> f && ((g) || h) ; i";
std::sregex_token_iterator srti(exp.begin(), exp.end(), rx, {-1, 0});
std::vector<std::string> tokens;
std::remove_copy_if(srti, std::sregex_token_iterator(), 
                std::back_inserter(tokens),
                [](std::string const &s) { return s.empty(); });
for( auto & p : tokens ) std::cout <<"'"<< p <<"'"<< std::endl;

See the C++ demo , output:

'a < b | c | d > e >> f '
'&&'
' '
'('
'('
'g'
')'
' '
'||'
' h'
')'
' '
';'
' i'

Special credit for the empty string removal code goes to Jerry Coffin .

Answer 2

You haven't specified which language you're using, but most app languages would support splitting a string on this regex:

" *((?=(\$\$|\|\||[;()])|(?<=\$\$|\|\|)|(?<=[;()])) *"

The regex is a look ahead or look behind for your terms, but being look arounds the input is not consumed so the delimiters will be output to the result array.

If you're using python, thing are much simpler; split on this regex:

" *(\$\$|\|\||[;()]) *"

Whatever of the delimiter is captured , becomes part of the output array.

Answer 3

I have prepared the following regex and tested it it produces exactly the same output as described on your input string:

(?<=&&)[^;()]*|\(|\)|(?<=\|\|)[^;()]*|;|&&|\|\||([^|;()&]+(\‌|[^|;()&]+)*)*

or this one:

\(|\)|;|&&|\|\||([^|;()&]+(&[^|;()&]+|\|[^|;()&]+)*)

Let me know if it works as expected!

Matches:

a < b | c | d > e >> f 
&&

(
(
g
)

||
 h
)

;
 i

and tested on:

(cat file > outFile) || ( ls -l | grep -i )
(cat file >> outFile) && ls -l | grep -i
((file < file) || ls -l ; ls)
cat < InputFile | tr a-z A-Z | tee out1 > out2 >> out3 | asd aasdasd  | asd | asd || asd | asd
a | b || c | d && a || b && d ; g && 
a && b || c > d >> e < f | g
a < b | c | d > e >> f && ((g) || h) ; i

Tokenize a String and Keep Delimiters Using Regular Expression in C++

Question

3 answers

solution1
2 ACCPTED 2017-12-05 08:16:01

solution2
1 2017-12-05 06:07:48

solution3
0 2017-12-05 06:42:27

Tokenize a String and Keep Delimiters Using Regular Expression in C++

Question

3 answers

solution1 2 ACCPTED 2017-12-05 08:16:01

solution2 1 2017-12-05 06:07:48

solution3 0 2017-12-05 06:42:27

solution1
2 ACCPTED 2017-12-05 08:16:01

solution2
1 2017-12-05 06:07:48

solution3
0 2017-12-05 06:42:27