简体   繁体   English

正则表达式导致堆栈溢出

[英]Regular Expression causing Stack Overflow

Further to my previous question: ECMAScript Regex for a multilined string , I have implemented the following loading procedure: 继上一个问题: ECMAScript Regex的多行字符串 ,我实现了以下加载过程:

void Load( const std::string& szFileName )
{
     static const std::regex regexObject( "=== ([^=]+) ===\\n((?:.|\\n)*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );
     static const std::regex regexData( "<([^>]+)>:([^<]*)\\n", std::regex_constants::ECMAScript | std::regex_constants::optimize );

     std::ifstream inFile( szFileName );
     inFile.exceptions( std::ifstream::badbit );

     std::string szFileData( (std::istreambuf_iterator<char>(inFile)), (std::istreambuf_iterator<char>()) );

     inFile.close();

     std::vector<std::future<void>> vecFutures;

     for( std::sregex_iterator itObject( szFileData.cbegin(), szFileData.cend(), regexObject ), end; itObject != end; ++itObject )
     {
          if( (*itObject)[1] == "OBJECT1" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
          else if( (*itObject)[1] == "OBJECT2" )
          {
               vecFutures.emplace_back( std::async( []( std::string szDataString ) {
                    for( std::sregex_iterator itData( szDataString.cbegin(), szDataString.cend(), regexData ) { // Do Stuff }
               }, (*itObject)[2].str() ) );
          }
     }

     for( auto& future : vecFutures )
     {
          future.get();
     }
}

However, loading it with this file results in a Stack Overflow (parameters: 0x00000001, 0x00332FE4): 但是,使用此文件加载会导致堆栈溢出(参数:0x00000001,0x00332FE4):

=== OBJECT2 ===
<Name>:Test Manufacturer
<Supplier>:Test Supplier
<Address>:Test Multiline
Contact
Address
<Email>:test@test.co.uk
<Telephone Number>:0123456789
=== END OBJECT2 ===
=== OBJECT1 ===
<Number>:1
<Name>:Test
<Location>:Here
<Manufacturer>:
<Model Number>:12345
<Serial Number>:54321
<Owner>:Me
<IP Address>:0.0.0.0
=== END OBJECT1 ===

I have been unable to find the source of the Stack Overflow but it looks like the outer std::sregex_iterator loop is responsible. 我一直无法找到Stack Overflow的源代码,但看起来外部的std::sregex_iterator循环负责。

Thanks in advance! 提前致谢!

Holy catastrophic backtracking. 神圣的灾难性回溯。 The culprit is (?:.|\\\\n)* . 罪魁祸首是(?:.|\\\\n)* Whenever you see a construct like this you know you're asking for trouble. 每当你看到这样的结构时,你就知道你在寻找麻烦。

Why? 为什么? Because you're telling the engine to match any character (except newline) OR newline, as many times as possible, or none. 因为你告诉引擎匹配任何字符(换行符除外)或换行符,尽可能多次匹配,或者没有。 Let me walk you through it. 让我带你走过。

The engine will start as expected and match the === OBJECT2 === -part without any major issues, a newline will be consumed, and hell will then begin. 引擎将按预期启动并匹配=== OBJECT2 === -part而没有任何重大问题,将消耗换行符,然后地狱将开始。 The engine consumes EVERYTHING, all the way down to === END OBJECT1 === , and backtrack its way from there to a suitable match. 引擎消耗一切,一直到=== END OBJECT1 === ,并从那里回溯到合适的匹配。 Backtracking basically means going back one step and applying the regex again to see if it works. 回溯基本上意味着返回一步并再次应用正则表达式以查看它是否有效。 Basically trying all possible permutations with your string. 基本上用你的字符串尝试所有可能的排列。 This will, in your case, result in a few hundred thousand attempts. 在您的情况下,这将导致几十万次尝试。 That's probably why stuff is being problematic for you. 这可能就是为什么东西对你来说有问题。

I don't know if your code is any better or if it has any errors in it, but (?:.|\\\\n)* is the same as writing .* with the * s *ingle line modifier (dot matches newlines) or [\\S\\s]* . 我不知道你的代码是否更好或者它是否有任何错误,但是(?:.|\\\\n)* .*与使用* s * ingle line modifier编写.*相同(dot匹配换行符)或[\\S\\s]* If you replace that construct with one of the two I have recommended you will hopefully no longer see a stack overflow error. 如果用我建议的两个中的一个替换该构造,您将希望不再看到堆栈溢出错误。

Edit: Check out the other solutions too, I did not really have time to go in-depth and provide a solid solution yo your problem besides explaining why its so bad. 编辑:检查其他解决方案,我没有时间深入研究并提供一个可靠的解决方案,除了解释为什么它如此糟糕。

Here's another attempt: 这是另一个尝试:

=== ([^=]+) ===\n((?:(?!===)[^\n]+\n)+)=== END \1 ===

In your C++ it would obviously be written as: 在你的C ++中,它显然会写成:

=== ([^=]+) ===\\n((?:(?!===)[^\\n]+\\n)+)=== END \\1 ===

It's made for minimal backtracking (at least when matching), although I'm a bit Mr. Tired-Face at the moment, so probably missed quite a few ways to improve it. 这是为了最小的回溯(至少在匹配时),虽然我现在有点先生Tired-Face,所以可能错过了很多方法来改进它。

It makes two assumptions , which are used to avoid a lot of backtracking (that possibly causes the stack overflow, as others have said): 它做了两个假设 ,用于避免大量的回溯(可能导致堆栈溢出,正如其他人所说):

  1. That there's never a === at the start of a line, except for the start/end marker lines. 除了开始/结束标记线之外,在行的开头永远不会有===
  2. That C++ supports these regex features - specifically the use of a negative lookahead ( ?! ). C ++支持这些正则表达式功能 - 特别是使用否定前瞻( ?! )。 It should, considering it's ECMAScript dialect. 应该考虑它的ECMAScript方言。

Explained: 解释:

=== ([^=]+) ===\n

Match and capture the object start marker. 匹配并捕获对象开始标记。 The [^=] is one way to avoid a relatively small amount of backtracking here, same as yours - we're not using [^ ] , because I do not know if there may be spaces in the OBJECT id. [^=]是避免相对少量回溯的一种方法,与你的相同 - 我们没有使用[^ ] ,因为我不知道OBJECT id中是否有空格。

((?:

Start capturing group for data. 开始捕获数据组。 Inside it, a non-capturing group, because we're going to match each line individually. 在它内部,一个非捕获组,因为我们将分别匹配每一行。

   (?!===)

Negative lookahead - we don't want === at the start of our captured line. 否定前瞻 - 我们不希望在我们捕获的行的开头===

   [^\n]+\n

Matches one line individually. 单独匹配一行。

)+)

Match at least one line between start and end markers, then capture ALL the lines in a single group. 在开始和结束标记之间匹配至少一行,然后捕获单个组中的所有行。

=== END \1 ===

Match the end marker. 匹配结束标记。

Comparison (using RegexBuddy): 比较(使用RegexBuddy):

Original version: 原始版本:

  • First match: 1277 steps 第一场比赛:1277步
  • Failed match: 1 step (this is due to the line break between the objects) 匹配失败:1步(这是由于对象之间的换行)
  • Second match: 396 steps 第二场比赛:396步

Every added object will cause the amount of steps to grow for the previous ones. 每个添加的对象都会导致前一个步骤增加。 Eg, adding one more object (copy of object 2, renamed to 3) will result in: 2203 steps, 1322 steps, 425 steps. 例如,添加一个对象(对象2的副本,重命名为3)将导致:2203步,1322步,425步。

This version: 这个版本:

  • First match: 67 steps 第一场比赛:67步
  • Failed match: 1 step (once again due to the line break between the objects) 失败的匹配:1步(再次由于对象之间的换行)
  • Second match: 72 steps 第二场比赛:72步
  • Failed match: 1 step 失败的匹配:1步
  • Third match: 67 steps 第三场比赛:67步

Your expressions appear to be causeing a lot of backtracking. 你的表达似乎导致了大量的回溯。 I would change your expressions to: 我会将你的表达式改为:

First: ^===\\s+(.*?)\\s+===[\\r\\n]+^(.*?)[\\r\\n]+^===\\s+END\\s+\\1\\s+=== 首先: ^===\\s+(.*?)\\s+===[\\r\\n]+^(.*?)[\\r\\n]+^===\\s+END\\s+\\1\\s+===

Second: ^<([^>]+)>:([^<]*) 第二: ^<([^>]+)>:([^<]*)

Both of these expressions work with the options: Multiline, and DotMatchesAll options. 这两个表达式都可以使用选项:Multiline和DotMatchesAll选项。 By including the start of line anchor ^ it limits the backtracking to at most one line or one group. 通过包括线锚的起点^它将回溯限制为至多一行或一组。

请尝试使用此模式:

static const std::regex regexObject( "=== (\\S+) ===\\n((?:[^\\n]+|\\n(?!=== END \\1 ===))*)\\n=== END \\1 ===", std::regex_constants::ECMAScript | std::regex_constants::optimize );

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM