正則表達式 boost 庫 regex_search 匹配錯誤

Question

當我使用 boost::regex_search 函數來做這樣的事情時

std::string sTest = "18";
std::string sRegex = "^([\\u4E00-\\u9FA5]+)(\\d+)(#?)$";
std::string::const_iterator iterStart = sTest.begin();
std::string::const_iterator iterEnd = sTest.end();
boost::match_results<std::string::const_iterator> RegexResults;

while (boost::regex_search(iterStart, iterEnd, RegexResults, boost::regex(sRegex)))
{
    int a = 1;
    break; 
}

然而值 'stest' 是匹配的，但是當我使用 std::regex_search 時就可以了。

Answer 1

假設問題很嚴重：

正則表達式匹配

^ （輸入開始）
看起來您打算作為一個或多個“CJK 統一表意文字”塊（盡管僅來自 1.0.1 Unicode 標准）。
但是，這不是解析的內容。 （相反，它確實解析為匹配1常規十六進制轉義符）。
文檔告訴我你可能想要\\x{dddd}但這需要 Unicode 支持。
挖掘更多文檔告訴我

有兩種方法可以將 Boost.Regex 與 Unicode 字符串一起使用：

依賴 wchar_t

（列出了一堆限制和條件）

使用 Unicode Aware 正則表達式類型。

我可以建議后者嗎
一個或多個數字（根據語言環境的字符分類）
(#?)可能是作為注釋（ (?#) ）但拼寫可選匹配單個#字符
后跟$ （輸入結束）

這不在您的輸入中，因此不應該匹配。

題外話？

此外，由於這是一個完全錨定的模式（ ^$ ），它只對regex_match有意義，而不是regex_search 。

while 循環是一個奇怪的想法，因為輸入永遠不會改變，所以搜索結果也不會改變。 如果匹配，循環總是中斷。 您的while相當於一個更令人困惑的if語句。

修復

這是一個使用三種不同正則表達式編寫三種可切換方法的程序。

方法是 BOOST_SIMPLE、BOOST_UNICODE 和 STANDARD_LIB。

第一個正則表達式是你的，第二個使用\\x{XXXX}轉義，第三個使用命名字符類\\p{InCJK_Unified_Ideographs} 。

結果是：

Boost + Unicode 方法正常工作（除了您的錯誤正則表達式）
標准庫（在我的 GCC 10 安裝中）以及 Boost Simple似乎接受命名的字符類。 我懷疑它實際上是否正確匹配，所以我猜它只是無法匹配。

清單

住在 Coliru

#include <iostream>
#include <iomanip>

#ifdef STANDARD_LIB
  #include <regex>
  using match = std::smatch;
  using regex = std::regex;
  #define ctor regex
#elif defined(BOOST_SIMPLE)
  #include <boost/regex.hpp>
  using match = boost::smatch;
  using regex = boost::regex;
  #define ctor regex
#elif defined(BOOST_UNICODE)
  #include <boost/regex/icu.hpp>
  using match = boost::smatch;
  using regex = boost::u32match;
  #define ctor        boost::make_u32regex
  #define regex_match boost::u32regex_match
#else
  #error "Need to pick a flavour"
#endif

int main() {
    std::string const sTest = "18";

    struct testcase { std::string_view label, re; };

    for (auto current : { testcase
        { "wrong unicode character class",
            R"(^([\u4E00-\u9FA5]+)(\d+)(#?)$)" },
        { "correct unicode character class",
            R"(^([\x{4E00}-\x{9FA5}]+)(\d+)(#?)$)" },
        { "named character class",
            R"(^([\p{InCJK_Unified_Ideographs}]+)(\d+)(#?)$)" },
    }) {
        std::cout
            << std::string(current.label.length(), '=') << "\n"
            << current.label << " (" << std::quoted(current.re) << ")\n";

        try {
            match results;
            bool is_match = regex_match(sTest, results, ctor(current.re.data()));

            std::cout << "is_match: " << std::boolalpha << is_match << "\n";
            if (is_match) {
                std::cout << "$1: " << std::quoted(results[1].str()) << "\n";
                std::cout << "$2: " << std::quoted(results[2].str()) << "\n";
                std::cout << "$3: " << std::quoted(results[3].str()) << "\n";
            }
        } catch(std::exception const& e) {
            std::cerr << "failure: " << e.what() << "\n";
        }
    }
}

編譯

g++ -DBOOST_SIMPLE  -O2 -std=c++17 main.cpp -lboost_regex -o boost_simple
g++ -DBOOST_UNICODE -O2 -std=c++17 main.cpp -lboost_regex -o boost_unicode -licuuc
g++ -DSTANDARD_LIB  -O2 -std=c++17 main.cpp -o Standard_lib

有輸出：

文件boost_simple.log

 ============================= wrong unicode character class ("^([\\\一-\\\龥]+)(\\\\d+)(#?)$") is_match: true $1: "1" $2: "8" $3: "" =============================== correct unicode character class ("^([\\\\x{4E00}-\\\\x{9FA5}]+)(\\\\d+)(#?)$") failure: Hexadecimal escape sequence was invalid. The error occurred while parsing the regular expression fragment: '^([>>>HERE>>>\\x{4E00}-\\'. ===================== named character class ("^([\\\\p{InCJK_Unified_Ideographs}]+)(\\\\d+)(#?)$") is_match: false

文件boost_unicode.log

 ============================= wrong unicode character class ("^([\\\一-\\\龥]+)(\\\\d+)(#?)$") is_match: true $1: "1" $2: "8" $3: "" =============================== correct unicode character class ("^([\\\\x{4E00}-\\\\x{9FA5}]+)(\\\\d+)(#?)$") is_match: false ===================== named character class ("^([\\\\p{InCJK_Unified_Ideographs}]+)(\\\\d+)(#?)$") is_match: false

文件standard_lib.log

 ============================= wrong unicode character class ("^([\\\一-\\\龥]+)(\\\\d+)(#?)$") failure: Invalid range in bracket expression. =============================== correct unicode character class ("^([\\\\x{4E00}-\\\\x{9FA5}]+)(\\\\d+)(#?)$") failure: Unexpected end of regex when ascii character. ===================== named character class ("^([\\\\p{InCJK_Unified_Ideographs}]+)(\\\\d+)(#?)$") is_match: false

正則表達式 boost 庫 regex_search 匹配錯誤

問題描述

1 個解決方案

解決方案1
2 已采納 2020-12-11 04:02:52

依賴 wchar_t

使用 Unicode Aware 正則表達式類型。

題外話？

修復

清單

正則表達式 boost 庫 regex_search 匹配錯誤

問題描述

1 個解決方案

解決方案1 2 已采納 2020-12-11 04:02:52

依賴 wchar_t

使用 Unicode Aware 正則表達式類型。

題外話？

修復

清單

解決方案1
2 已采納 2020-12-11 04:02:52