為什么std :: regex_iterator會導致堆棧溢出這個數據？

Question

我一直在使用std::regex_iterator來解析日志文件。 我的程序已經運行了好幾個星期並且解析了數百萬條日志行，直到今天，今天我在日志文件中運行並且堆棧溢出。 事實證明，日志文件中只有一個日志行導致了問題。 有誰知道為什么我的正則表達式引起如此大規模的遞歸？ 這是一個小的自包含程序，它顯示了問題（我的編譯器是VC2012）：

#include <string>
#include <regex>
#include <iostream>

using namespace std;

std::wstring test = L"L3  T15356 79726859 [CreateRegistryAction] Creating REGISTRY Action:\n"
                L"  Identity: 272A4FE2-A7EE-49B7-ABAF-7C57BEA0E081\n"
                L"  Description: Set Registry Value: \"SortOrder\" in Key HKEY_CURRENT_USER\\Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
                L"  Operation: 3\n"
                L"  Hive: HKEY_CURRENT_USER\n"
                L"  Key: Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
                L"  ValueName: SortOrder\n"
                L"  ValueType: REG_DWORD\n"
                L"  ValueData: 0\n"
                L"L4  T15356 79726859 [CEMRegistryValueAction::ClearRevertData] [ENTER]\n";

int wmain(int argc, wchar_t* argv[])
{
    static wregex rgx_log_lines(
        L"^L(\\d+)\\s+"             // Level
        L"T(\\d+)\\s+"              // TID
        L"(\\d+)\\s+"               // Timestamp
        L"\\[((?:\\w|\\:)+)\\]"     // Function name
        L"((?:"                     // Complex pattern
          L"(?!"                    // Stop matching when...
            L"^L\\d"                // New log statement at the beginning of a line
          L")"                      
          L"[^]"                    // Matching all until then
        L")*)"                      // 
        );

    try
    {
        for (std::wsregex_iterator it(test.begin(), test.end(), rgx_log_lines), end; it != end; ++it)
        {
            wcout << (*it)[1] << endl;
            wcout << (*it)[2] << endl;
            wcout << (*it)[3] << endl;
            wcout << (*it)[4] << endl;
            wcout << (*it)[5] << endl;
        }
    }
    catch (std::exception& e)
    {
        cout << e.what() << endl;
    }

    return 0;
}

Answer 1

對每個角色進行測試的負面前瞻模式對我來說似乎是一個壞主意，而你想要做的事情並不復雜。 你想匹配（1）線的其余部分，然后（2）任何數量的以下（3）行以L \\ d以外的東西開始（小bug;見下文）:(另一個編輯：這些是正則表達式;如果要將它們寫為字符串文字，則需要將\\更改為\\\\ 。）

 .*\n(?:(?:[^L]|L\D).*\n)*
 |   |  |
 +-1 |  +---------------3
     +---------------------2

在Ecmascript模式中， . 不應該匹配\\ n，但你總是可以替換這兩個. 在[^\\n]表達式中

編輯添加：我意識到如果在日志條目結束之前有一個空行，這可能不起作用，但這應該涵蓋這種情況; 我改變. 到[^\\n]以獲得額外的精度：

 [^\n]*\n(?:(?:(?:[^L\n]|L\D)[^\n]*)?\n)*

Answer 2

正則表達式似乎沒問題; 至少沒有任何東西可以導致災難性的回溯。

我發現優化正則表達式的可能性很小，減少了堆棧的使用：

static wregex rgx_log_lines(
    L"^L(\\d+)\\s+"             // Level
    L"T(\\d+)\\s+"              // TID
    L"(\\d+)\\s+"               // Timestamp
    L"\\[([\\w:]+)\\]"          // Function name
    L"((?:"                     // Complex pattern
      L"(?!"                    // Stop matching when...
        L"^L\\d"                // New log statement at the beginning of a line
      L")"                      
      L"[^]"                    // Matching all until then
    L")*)"                      // 
    );

你有沒有設置ECMAScript選項？ 否則，我懷疑正則表達式庫默認為POSIX正則表達式，並且那些不支持超前斷言。

為什么std :: regex_iterator會導致堆棧溢出這個數據？

問題描述

2 個解決方案

解決方案1
4 已采納 2012-10-10 22:10:32

解決方案2
1 2012-10-10 21:48:48

為什么std :: regex_iterator會導致堆棧溢出這個數據？

問題描述

2 個解決方案

解決方案1 4 已采納 2012-10-10 22:10:32

解決方案2 1 2012-10-10 21:48:48

解決方案1
4 已采納 2012-10-10 22:10:32

解決方案2
1 2012-10-10 21:48:48