简体   繁体   中英

Regular expression group matching using Boost::regex

I have strings of format:

7XXXX 8YYYY 9ZZZZ 0LLLL 7XXXX 8YYYY 9ZZZZ 0LLLL ,

  • where 7XXXX 8YYYY 9ZZZZ 0LLLL groups can repeat any number of times;
  • X, Y, Z, L are digits;
  • Groups starting 7,8,9,0 all go in sequence
  • there can be missing groups like 7XXXX 0LLLL 8YYYY 0LLLL 7XXXX 8YYYY 9ZZZZ 0LLLL

I am trying to accomplish my goal using Boost::regex library.

I want to split these groups and get them into an array or vector. For now I am trying to cout them.

I am trying to do it this way, but I only can get full string match or last match in every of 7,8,9,0 groups, but not strings like these 7XXXX 8YYYY 9ZZZZ 0LLLL

 const char* pat = "(([[:space:]]+7[0-9]{4}){0,1}([[:space:]]+8[0-9]{4}){0,1}([[:space:]]+9[0-9]{4}){0,1}([[:space:]]+0[0-9]{4}){0,1})+";;
 boost::regex reg(pat);
 boost::smatch match;
 string example= "71122 85451 75415 01102 75555 82133 91341 02134";

 const int subgroups[] = {0,1,2,3,4,5,6};
 boost::sregex_token_iterator i(example.begin(), example.end(), reg, subgroups);
 boost::sregex_token_iterator j;

 while (i != j)
 {
   cout << "Match: " << *i++ << endl;
 }

Sample output:

Match: 71122 85451 75415 01102 75555 82133 91341 02134
<A bunch of empty "Match:" rows>
Match: 75555
Match: 82133
Match: 91341
Match: 02134
<A bunch of empty "Match:" rows>

But I want to get it like this:

71122 85451 
75415 01102 
75555 82133 91341 02134

I know I am doing it wrong, can't come up with something good using regex to do what I want :( Why can't I get all the recursive matches using parentheses?

EDIT: Since I completely misunderstood the first time around, I'll just replace the whole answer. I'm thinking along these lines:

const char* pat = "[[:space:]]+((7[0-9]{4})?([[:space:]]+8[0-9]{4})?([[:space:]]+9[0-9]{4})?([[:space:]]+0[0-9]{4})?)";
boost::regex reg(pat);
boost::smatch match;

//                    v-- extra space here to make the match easier.
std::string example= " 71122 85451 75415 01102 75555 82133 91341 02134";

boost::sregex_token_iterator i(example.begin(), example.end(), reg, 1);
boost::sregex_token_iterator j;

while (i != j)
{
  std::cout << "Match: " << *i++ << std::endl;
}

If the string cannot be modified, a workaround around the problem of empty matches is

const char* pat = "((7[0-9]{4})?([[:space:]]+8[0-9]{4})?([[:space:]]+9[0-9]{4})?([[:space:]]+0[0-9]{4})?)";
boost::regex reg(pat);
boost::smatch match;
std::string example= "71122 85451 75415 01102 75555 82133 91341 02134";

boost::sregex_token_iterator i(example.begin(), example.end(), reg, 1);
boost::sregex_token_iterator j;

while (i != j)
{
  if(i->length() != 0) {
    std::cout << "Match: " << *i << std::endl;
  }

  ++i;
}

Although in that case it'd arguably be nicer to use regex_iterator instead of regex_token_iterator :

// No need for outer spaces anymore
const char* pat = "(7[0-9]{4})?([[:space:]]+8[0-9]{4})?([[:space:]]+9[0-9]{4})?([[:space:]]+0[0-9]{4})?";

boost::sregex_iterator i(example.begin(), example.end(), reg);
boost::sregex_iterator j;

// Rest the same.

I think I'd hand roll a parser here. In the interest of agility, how about parsing with Spirit

  • It parses directly into sequence vectors.
  • There is no problem dealing with whitespace.
  • The grammar is described declaratively, in a syntax that somewhat resembles regular expressions but is tied in with the C++ language much stronger .
  • It expresses intent quite clearly: a sequence is any combination of items in the expected order - as long as the result has at least one item

     seq_ = -item_('7') >> -item_('8') >> -item_('9') >> -item_('0'); 

    where item_ parses any integer that starts with the indicated digit:

     item_ = &char_(_r1) >> uint_; 

    In the parser we parse any number of sequences with *seq which is why we added a check that each matched sequence is not empty (otherwise we could get an infinite loop matching empty sequences at the same input location)

     eps(phx::size(_val) > 0) // require 1 element at least 
  • Note how debugging is built in (enable it by uncommenting the first line).

  • Note how it would be trivial to exclude the leading digits from the result by omitting the lead character: See alternative version on Coliru:

     item_ = omit[char_(_r1)] >> uint_; 

Test program output:

Parsing: 71122 85451 75415 01102 75555 82133 91341 02134
Parsed: 3 sequences

seq:    71122 85451 
seq:    75415 1102 
seq:    75555 82133 91341 2134

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

namespace qi  = boost::spirit::qi;
namespace phx = boost::phoenix;

using data = std::vector<std::vector<unsigned> >;

template <typename It, typename Skipper = qi::space_type> 
struct grammar : qi::grammar<It, data(), Skipper> {
    grammar() : grammar::base_type(start) {
        using namespace qi;

        start = *seq_;

        seq_  = -item_('7') >> -item_('8') >> -item_('9') >> -item_('0')
              >> eps(phx::size(_val) > 0)
              ;

        item_ = &char_(_r1) >> uint_;

        BOOST_SPIRIT_DEBUG_NODES((start)(item_)(seq_))
    }

  private:
    qi::rule<It, unsigned(char), Skipper> item_;
    qi::rule<It, std::vector<unsigned>(), Skipper> seq_;
    qi::rule<It, data(), Skipper> start;
};

int main() { 

    for (std::string const input : {
            "71122 85451 75415 01102 75555 82133 91341 02134"
            })
    {
        using It = std::string::const_iterator;
        grammar<It> p;
        auto f(input.begin()), l(input.end());

        data parsed;
        bool ok = qi::phrase_parse(f,l,p,qi::space,parsed);

        std::cout << "Parsing: " << input << "\n";
        if (ok) {
            std::cout << "Parsed: " << parsed.size() << " sequences\n";
            for(auto& seq : parsed)
                std::copy(seq.begin(), seq.end(), std::ostream_iterator<unsigned>(std::cout << "\nseq:\t", " "));
            std::cout << "\n";
        } else {
            std::cout << "Parsed failed\n";
        }

        if (f!=l)
            std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n";
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM