Why does std::regex_iterator cause a stack overflow with this data?

Question

I've been using std::regex_iterator to parse log files. My program has been working quite nicely for some weeks and has parsed millions of log lines, until today, when today I ran it against a log file and got a stack overflow. It turned out that just one log line in the log file were causing the problem. Does anyone know know why my regex is causing such massive recursion? Here's a small self contained program which shows the issue (my compiler is VC2012):

#include <string>
#include <regex>
#include <iostream>

using namespace std;

std::wstring test = L"L3  T15356 79726859 [CreateRegistryAction] Creating REGISTRY Action:\n"
                L"  Identity: 272A4FE2-A7EE-49B7-ABAF-7C57BEA0E081\n"
                L"  Description: Set Registry Value: \"SortOrder\" in Key HKEY_CURRENT_USER\\Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
                L"  Operation: 3\n"
                L"  Hive: HKEY_CURRENT_USER\n"
                L"  Key: Software\\Hummingbird\\PowerDOCS\\Core\\Plugins\\Fusion\\Settings\\DetailColumns\\LONEDOCS1\\Search Unsaved\\$AUTHOR.FULL_NAME;DOCSADM.PEOPLE.SYSTEM_ID\n"
                L"  ValueName: SortOrder\n"
                L"  ValueType: REG_DWORD\n"
                L"  ValueData: 0\n"
                L"L4  T15356 79726859 [CEMRegistryValueAction::ClearRevertData] [ENTER]\n";

int wmain(int argc, wchar_t* argv[])
{
    static wregex rgx_log_lines(
        L"^L(\\d+)\\s+"             // Level
        L"T(\\d+)\\s+"              // TID
        L"(\\d+)\\s+"               // Timestamp
        L"\\[((?:\\w|\\:)+)\\]"     // Function name
        L"((?:"                     // Complex pattern
          L"(?!"                    // Stop matching when...
            L"^L\\d"                // New log statement at the beginning of a line
          L")"                      
          L"[^]"                    // Matching all until then
        L")*)"                      // 
        );

    try
    {
        for (std::wsregex_iterator it(test.begin(), test.end(), rgx_log_lines), end; it != end; ++it)
        {
            wcout << (*it)[1] << endl;
            wcout << (*it)[2] << endl;
            wcout << (*it)[3] << endl;
            wcout << (*it)[4] << endl;
            wcout << (*it)[5] << endl;
        }
    }
    catch (std::exception& e)
    {
        cout << e.what() << endl;
    }

    return 0;
}

Answer 1

Negative lookahead patterns which are tested on every character just seem like a bad idea to me, and what you're trying to do is not complicated. You want to match (1) the rest of the line and then (2) any number of following (3) lines which start with something other than L\\d (small bug; see below): (another edit: these are regexes; if you want to write them as string literals, you need to change \\ to \\\\ .)

 .*\n(?:(?:[^L]|L\D).*\n)*
 |   |  |
 +-1 |  +---------------3
     +---------------------2

In Ecmascript mode, . should not match \\n, but you could always replace the two . s in that expression with [^\\n]

Edited to add: I realize that this may not work if there is a blank line just before the end of the log entry, but this should cover that case; I changed . to [^\\n] for extra precision:

 [^\n]*\n(?:(?:(?:[^L\n]|L\D)[^\n]*)?\n)*

Answer 2

The regex appears to be OK; at least there is nothing in it that could cause catastrophic backtracking.

I see a small possibility to optimize the regex, cutting down on stack use:

static wregex rgx_log_lines(
    L"^L(\\d+)\\s+"             // Level
    L"T(\\d+)\\s+"              // TID
    L"(\\d+)\\s+"               // Timestamp
    L"\\[([\\w:]+)\\]"          // Function name
    L"((?:"                     // Complex pattern
      L"(?!"                    // Stop matching when...
        L"^L\\d"                // New log statement at the beginning of a line
      L")"                      
      L"[^]"                    // Matching all until then
    L")*)"                      // 
    );

Did you set the ECMAScript option ? Otherwise, I suspect the regex library defaults to POSIX regexes, and those don't support lookahead assertions.

Why does std::regex_iterator cause a stack overflow with this data?

Question

2 answers

solution1
4 ACCPTED 2012-10-10 22:10:32

solution2
1 2012-10-10 21:48:48

Why does std::regex_iterator cause a stack overflow with this data?

Question

2 answers

solution1 4 ACCPTED 2012-10-10 22:10:32

solution2 1 2012-10-10 21:48:48

solution1
4 ACCPTED 2012-10-10 22:10:32

solution2
1 2012-10-10 21:48:48