简体   繁体   中英

regular expression boost library regex_search match error

when i use boost::regex_search function to do something like this

std::string sTest = "18";
std::string sRegex = "^([\\u4E00-\\u9FA5]+)(\\d+)(#?)$";
std::string::const_iterator iterStart = sTest.begin();
std::string::const_iterator iterEnd = sTest.end();
boost::match_results<std::string::const_iterator> RegexResults;

while (boost::regex_search(iterStart, iterEnd, RegexResults, boost::regex(sRegex)))
{
    int a = 1;
    break; 
}

however value 'stest' is matched,but when i use std::regex_search it's ok.

Assuming the question is serious:

the regex matches

  1. ^ (start of input)

  2. What looks like you intended as one or more of the "CJK Unified Ideographs" block (though only from the 1.0.1 Unicode standard).

    However, this is not what is parsed. (Instead, indeed it parses as regular hex escapes which does match 1 ).

    The docs tell me that you might have wanted \\x{dddd} but that requires Unicode support.

    Digging in more docs tell me that

    There are two ways to use Boost.Regex with Unicode strings:

    Rely on wchar_t

    (lists a bunch of limitations and conditions)

    Use a Unicode Aware Regular Expression Type.

    May I suggest the latter

  3. one or more numerical digits (according to the locale's character classicification)

  4. (#?) might have been intended as a comment ( (?#) ) but as spelled optionally matches a single # character

  5. followed by $ (end of input)

That's not in your input, so it shouldn't match.

Off Topic?

Besides, since this is a fully anchored pattern ( ^$ ) it would only make sense with regex_match , not regex_search .

The while loop is a weird idea, because the input never changes, so neither will the search result. If there's a match, the loop always breaks. Your while amounts to a more confusing if statement.

The Fix

Here's a program that codes three switchable approaches with three different regexes.

The approaches are BOOST_SIMPLE, BOOST_UNICODE and STANDARD_LIB.

The first regex is yours, the second with \\x{XXXX} escapes instead and the third is using the named character class \\p{InCJK_Unified_Ideographs} .

The result is:

  • Boost + Unicode approach works correctly (except with your faulty regex)
  • Standard library (on my GCC 10 install) as well as Boost Simple appear to accept the named character class. I'm doubtful that it actually matches correctly, so I guess it will just fail to match.

Listing

Live On Coliru

#include <iostream>
#include <iomanip>

#ifdef STANDARD_LIB
  #include <regex>
  using match = std::smatch;
  using regex = std::regex;
  #define ctor regex
#elif defined(BOOST_SIMPLE)
  #include <boost/regex.hpp>
  using match = boost::smatch;
  using regex = boost::regex;
  #define ctor regex
#elif defined(BOOST_UNICODE)
  #include <boost/regex/icu.hpp>
  using match = boost::smatch;
  using regex = boost::u32match;
  #define ctor        boost::make_u32regex
  #define regex_match boost::u32regex_match
#else
  #error "Need to pick a flavour"
#endif

int main() {
    std::string const sTest = "18";

    struct testcase { std::string_view label, re; };

    for (auto current : { testcase
        { "wrong unicode character class",
            R"(^([\u4E00-\u9FA5]+)(\d+)(#?)$)" },
        { "correct unicode character class",
            R"(^([\x{4E00}-\x{9FA5}]+)(\d+)(#?)$)" },
        { "named character class",
            R"(^([\p{InCJK_Unified_Ideographs}]+)(\d+)(#?)$)" },
    }) {
        std::cout
            << std::string(current.label.length(), '=') << "\n"
            << current.label << " (" << std::quoted(current.re) << ")\n";

        try {
            match results;
            bool is_match = regex_match(sTest, results, ctor(current.re.data()));

            std::cout << "is_match: " << std::boolalpha << is_match << "\n";
            if (is_match) {
                std::cout << "$1: " << std::quoted(results[1].str()) << "\n";
                std::cout << "$2: " << std::quoted(results[2].str()) << "\n";
                std::cout << "$3: " << std::quoted(results[3].str()) << "\n";
            }
        } catch(std::exception const& e) {
            std::cerr << "failure: " << e.what() << "\n";
        }
    }
}

Compiling with

g++ -DBOOST_SIMPLE  -O2 -std=c++17 main.cpp -lboost_regex -o boost_simple
g++ -DBOOST_UNICODE -O2 -std=c++17 main.cpp -lboost_regex -o boost_unicode -licuuc
g++ -DSTANDARD_LIB  -O2 -std=c++17 main.cpp -o Standard_lib

Has outputs:

  • File boost_simple.log

     ============================= wrong unicode character class ("^([\\\一-\\\龥]+)(\\\\d+)(#?)$") is_match: true $1: "1" $2: "8" $3: "" =============================== correct unicode character class ("^([\\\\x{4E00}-\\\\x{9FA5}]+)(\\\\d+)(#?)$") failure: Hexadecimal escape sequence was invalid. The error occurred while parsing the regular expression fragment: '^([>>>HERE>>>\\x{4E00}-\\'. ===================== named character class ("^([\\\\p{InCJK_Unified_Ideographs}]+)(\\\\d+)(#?)$") is_match: false
  • File boost_unicode.log

     ============================= wrong unicode character class ("^([\\\一-\\\龥]+)(\\\\d+)(#?)$") is_match: true $1: "1" $2: "8" $3: "" =============================== correct unicode character class ("^([\\\\x{4E00}-\\\\x{9FA5}]+)(\\\\d+)(#?)$") is_match: false ===================== named character class ("^([\\\\p{InCJK_Unified_Ideographs}]+)(\\\\d+)(#?)$") is_match: false
  • File standard_lib.log

     ============================= wrong unicode character class ("^([\\\一-\\\龥]+)(\\\\d+)(#?)$") failure: Invalid range in bracket expression. =============================== correct unicode character class ("^([\\\\x{4E00}-\\\\x{9FA5}]+)(\\\\d+)(#?)$") failure: Unexpected end of regex when ascii character. ===================== named character class ("^([\\\\p{InCJK_Unified_Ideographs}]+)(\\\\d+)(#?)$") is_match: false

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM