Regular expression group matching using Boost::regex

Question

I have strings of format:

7XXXX 8YYYY 9ZZZZ 0LLLL 7XXXX 8YYYY 9ZZZZ 0LLLL ,

where 7XXXX 8YYYY 9ZZZZ 0LLLL groups can repeat any number of times;
X, Y, Z, L are digits;
Groups starting 7,8,9,0 all go in sequence
there can be missing groups like 7XXXX 0LLLL 8YYYY 0LLLL 7XXXX 8YYYY 9ZZZZ 0LLLL

I am trying to accomplish my goal using Boost::regex library.

I want to split these groups and get them into an array or vector. For now I am trying to cout them.

I am trying to do it this way, but I only can get full string match or last match in every of 7,8,9,0 groups, but not strings like these 7XXXX 8YYYY 9ZZZZ 0LLLL

 const char* pat = "(([[:space:]]+7[0-9]{4}){0,1}([[:space:]]+8[0-9]{4}){0,1}([[:space:]]+9[0-9]{4}){0,1}([[:space:]]+0[0-9]{4}){0,1})+";;
 boost::regex reg(pat);
 boost::smatch match;
 string example= "71122 85451 75415 01102 75555 82133 91341 02134";

 const int subgroups[] = {0,1,2,3,4,5,6};
 boost::sregex_token_iterator i(example.begin(), example.end(), reg, subgroups);
 boost::sregex_token_iterator j;

 while (i != j)
 {
   cout << "Match: " << *i++ << endl;
 }

Sample output:

Match: 71122 85451 75415 01102 75555 82133 91341 02134
<A bunch of empty "Match:" rows>
Match: 75555
Match: 82133
Match: 91341
Match: 02134
<A bunch of empty "Match:" rows>

But I want to get it like this:

71122 85451 
75415 01102 
75555 82133 91341 02134

I know I am doing it wrong, can't come up with something good using regex to do what I want :( Why can't I get all the recursive matches using parentheses?

Answer 1

EDIT: Since I completely misunderstood the first time around, I'll just replace the whole answer. I'm thinking along these lines:

const char* pat = "[[:space:]]+((7[0-9]{4})?([[:space:]]+8[0-9]{4})?([[:space:]]+9[0-9]{4})?([[:space:]]+0[0-9]{4})?)";
boost::regex reg(pat);
boost::smatch match;

//                    v-- extra space here to make the match easier.
std::string example= " 71122 85451 75415 01102 75555 82133 91341 02134";

boost::sregex_token_iterator i(example.begin(), example.end(), reg, 1);
boost::sregex_token_iterator j;

while (i != j)
{
  std::cout << "Match: " << *i++ << std::endl;
}

If the string cannot be modified, a workaround around the problem of empty matches is

const char* pat = "((7[0-9]{4})?([[:space:]]+8[0-9]{4})?([[:space:]]+9[0-9]{4})?([[:space:]]+0[0-9]{4})?)";
boost::regex reg(pat);
boost::smatch match;
std::string example= "71122 85451 75415 01102 75555 82133 91341 02134";

boost::sregex_token_iterator i(example.begin(), example.end(), reg, 1);
boost::sregex_token_iterator j;

while (i != j)
{
  if(i->length() != 0) {
    std::cout << "Match: " << *i << std::endl;
  }

  ++i;
}

Although in that case it'd arguably be nicer to use regex_iterator instead of regex_token_iterator :

// No need for outer spaces anymore
const char* pat = "(7[0-9]{4})?([[:space:]]+8[0-9]{4})?([[:space:]]+9[0-9]{4})?([[:space:]]+0[0-9]{4})?";

boost::sregex_iterator i(example.begin(), example.end(), reg);
boost::sregex_iterator j;

// Rest the same.

Answer 2

I think I'd hand roll a parser here. In the interest of agility, how about parsing with Spirit

It parses directly into sequence vectors.
There is no problem dealing with whitespace.
The grammar is described declaratively, in a syntax that somewhat resembles regular expressions but is tied in with the C++ language much stronger .
It expresses intent quite clearly: a sequence is any combination of items in the expected order - as long as the result has at least one item
```
 seq_ = -item_('7') >> -item_('8') >> -item_('9') >> -item_('0'); 
```
where item_ parses any integer that starts with the indicated digit:
```
 item_ = &char_(_r1) >> uint_; 
```
In the parser we parse any number of sequences with *seq which is why we added a check that each matched sequence is not empty (otherwise we could get an infinite loop matching empty sequences at the same input location)
```
 eps(phx::size(_val) > 0) // require 1 element at least 
```
Note how debugging is built in (enable it by uncommenting the first line).
Note how it would be trivial to exclude the leading digits from the result by omitting the lead character: See alternative version on Coliru:
```
 item_ = omit[char_(_r1)] >> uint_; 
```

Test program output:

Parsing: 71122 85451 75415 01102 75555 82133 91341 02134
Parsed: 3 sequences

seq:    71122 85451 
seq:    75415 1102 
seq:    75555 82133 91341 2134

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

namespace qi  = boost::spirit::qi;
namespace phx = boost::phoenix;

using data = std::vector<std::vector<unsigned> >;

template <typename It, typename Skipper = qi::space_type> 
struct grammar : qi::grammar<It, data(), Skipper> {
    grammar() : grammar::base_type(start) {
        using namespace qi;

        start = *seq_;

        seq_  = -item_('7') >> -item_('8') >> -item_('9') >> -item_('0')
              >> eps(phx::size(_val) > 0)
              ;

        item_ = &char_(_r1) >> uint_;

        BOOST_SPIRIT_DEBUG_NODES((start)(item_)(seq_))
    }

  private:
    qi::rule<It, unsigned(char), Skipper> item_;
    qi::rule<It, std::vector<unsigned>(), Skipper> seq_;
    qi::rule<It, data(), Skipper> start;
};

int main() { 

    for (std::string const input : {
            "71122 85451 75415 01102 75555 82133 91341 02134"
            })
    {
        using It = std::string::const_iterator;
        grammar<It> p;
        auto f(input.begin()), l(input.end());

        data parsed;
        bool ok = qi::phrase_parse(f,l,p,qi::space,parsed);

        std::cout << "Parsing: " << input << "\n";
        if (ok) {
            std::cout << "Parsed: " << parsed.size() << " sequences\n";
            for(auto& seq : parsed)
                std::copy(seq.begin(), seq.end(), std::ostream_iterator<unsigned>(std::cout << "\nseq:\t", " "));
            std::cout << "\n";
        } else {
            std::cout << "Parsed failed\n";
        }

        if (f!=l)
            std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n";
    }
}

Regular expression group matching using Boost::regex

Question

2 answers

solution1
1 ACCPTED 2014-12-22 08:47:20

solution2
1 2014-12-22 12:19:21

Regular expression group matching using Boost::regex

Question

2 answers

solution1 1 ACCPTED 2014-12-22 08:47:20

solution2 1 2014-12-22 12:19:21

solution1
1 ACCPTED 2014-12-22 08:47:20

solution2
1 2014-12-22 12:19:21