简体   繁体   English

为什么std :: regex_iterator会导致堆栈溢出这个数据?

[英]Why does std::regex_iterator cause a stack overflow with this data?

I've been using std::regex_iterator to parse log files. 我一直在使用std::regex_iterator来解析日志文件。 My program has been working quite nicely for some weeks and has parsed millions of log lines, until today, when today I ran it against a log file and got a stack overflow. 我的程序已经运行了好几个星期并且解析了数百万条日志行,直到今天,今天我在日志文件中运行并且堆栈溢出。 It turned out that just one log line in the log file were causing the problem. 事实证明,日志文件中只有一个日志行导致了问题。 Does anyone know know why my regex is causing such massive recursion? 有谁知道为什么我的正则表达式引起如此大规模的递归? Here's a small self contained program which shows the issue (my compiler is VC2012): 这是一个小的自包含程序,它显示了问题(我的编译器是VC2012):

#include <string>
#include <regex>
#include <iostream>

using namespace std;

std::wstring test = L"L3  T15356 79726859 [CreateRegistryAction] Creating REGISTRY Action:\n"
                L"  Identity: 272A4FE2-A7EE-49B7-ABAF-7C57BEA0E081\n"
                L"  Description: Set Registry Value: \"SortOrder\" in Key HKEY_CURRENT_USER\\Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
                L"  Operation: 3\n"
                L"  Hive: HKEY_CURRENT_USER\n"
                L"  Key: Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
                L"  ValueName: SortOrder\n"
                L"  ValueType: REG_DWORD\n"
                L"  ValueData: 0\n"
                L"L4  T15356 79726859 [CEMRegistryValueAction::ClearRevertData] [ENTER]\n";

int wmain(int argc, wchar_t* argv[])
{
    static wregex rgx_log_lines(
        L"^L(\\d+)\\s+"             // Level
        L"T(\\d+)\\s+"              // TID
        L"(\\d+)\\s+"               // Timestamp
        L"\\[((?:\\w|\\:)+)\\]"     // Function name
        L"((?:"                     // Complex pattern
          L"(?!"                    // Stop matching when...
            L"^L\\d"                // New log statement at the beginning of a line
          L")"                      
          L"[^]"                    // Matching all until then
        L")*)"                      // 
        );

    try
    {
        for (std::wsregex_iterator it(test.begin(), test.end(), rgx_log_lines), end; it != end; ++it)
        {
            wcout << (*it)[1] << endl;
            wcout << (*it)[2] << endl;
            wcout << (*it)[3] << endl;
            wcout << (*it)[4] << endl;
            wcout << (*it)[5] << endl;
        }
    }
    catch (std::exception& e)
    {
        cout << e.what() << endl;
    }

    return 0;
}

Negative lookahead patterns which are tested on every character just seem like a bad idea to me, and what you're trying to do is not complicated. 对每个角色进行测试的负面前瞻模式对我来说似乎是一个坏主意,而你想要做的事情并不复杂。 You want to match (1) the rest of the line and then (2) any number of following (3) lines which start with something other than L\\d (small bug; see below): (another edit: these are regexes; if you want to write them as string literals, you need to change \\ to \\\\ .) 你想匹配(1)线的其余部分,然后(2)任何数量的以下(3)行以L \\ d以外的东西开始(小bug;见下文):(另一个编辑:这些是正则表达式;如果要将它们写为字符串文字,则需要将\\更改为\\\\ 。)

 .*\n(?:(?:[^L]|L\D).*\n)*
 |   |  |
 +-1 |  +---------------3
     +---------------------2

In Ecmascript mode, . 在Ecmascript模式中, . should not match \\n, but you could always replace the two . 不应该匹配\\ n,但你总是可以替换这两个. s in that expression with [^\\n] [^\\n]表达式中

Edited to add: I realize that this may not work if there is a blank line just before the end of the log entry, but this should cover that case; 编辑添加:我意识到如果在日志条目结束之前有一个空行,这可能不起作用,但这应该涵盖这种情况; I changed . 我改变. to [^\\n] for extra precision: [^\\n]以获得额外的精度:

 [^\n]*\n(?:(?:(?:[^L\n]|L\D)[^\n]*)?\n)*

The regex appears to be OK; 正则表达式似乎没问题; at least there is nothing in it that could cause catastrophic backtracking. 至少没有任何东西可以导致灾难性的回溯。

I see a small possibility to optimize the regex, cutting down on stack use: 我发现优化正则表达式的可能性很小,减少了堆栈的使用:

static wregex rgx_log_lines(
    L"^L(\\d+)\\s+"             // Level
    L"T(\\d+)\\s+"              // TID
    L"(\\d+)\\s+"               // Timestamp
    L"\\[([\\w:]+)\\]"          // Function name
    L"((?:"                     // Complex pattern
      L"(?!"                    // Stop matching when...
        L"^L\\d"                // New log statement at the beginning of a line
      L")"                      
      L"[^]"                    // Matching all until then
    L")*)"                      // 
    );

Did you set the ECMAScript option ? 你有没有设置ECMAScript选项 Otherwise, I suspect the regex library defaults to POSIX regexes, and those don't support lookahead assertions. 否则,我怀疑正则表达式库默认为POSIX正则表达式,并且那些不支持超前断言。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM