简体   繁体   中英

Catastrophic backtracking issue with regular expression

I am new in working with Regular Expressions and currently facing a problem regarding that.

I am trying to build a regular expression that matches string in below format:

OptionalStaticText{OptionalStaticText %(Placholder) OptionalStaticText {OptionalSubSection} OptionalStaticText} OptionalStaticText

Each Section or Subsection is denoted by {...} . Each Placeholder is denoted by %(...) . Each Section or Subsection can have arbitrary permutation of OptionalStaticText , %(Placholder) , and OptionalSubSection .

For this, I have created a regular expression which is as below, (also can be seen here ).

/^(?:(?:(?:[\s\w])*(?:({(?:(?:[\s\w])*[%\(\w\)]+(?:[\s\w])*)+(?:{(?:(?:[\s\w])*[%\(\w\)]+(?:[\s\w])*)+})*})+)(?:[\s\w])*)+)$/g

This expression matches perfectly the valid strings (for example: abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd as can be tested in the link given.

However, it causes a timeout whenever, the input string in invalid(for example: abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd , - is not a valid character as per the [\\s\\w] character group).

Such invalid string causes timeout via Catastrophic backtracking, which can also be tested in the above link.

I must have made some rookie mistake, but not sure what. Is there a change I should make to avoid this?

Thank You.

If you have timeout issue its probably because of this [%\\(\\w\\)]+
which is a class of characters contained in the form you're looking for.

Use the form itself instead.

^(?:(?:[\\s\\w]*(?:({(?:[\\s\\w]*%\\(\\w*\\)[\\s\\w]*)+(?:{(?:[\\s\\w]*%\\(\\w*\\)[\\s\\w]*)+})*})+)[\\s\\w]*)+)$

Formatted and tested:

 ^ 
 (?:
      (?:
           [\s\w]* 
           (?:
                (                             # (1 start)
                     {
                     (?:
                          [\s\w]* 
                          % \( \w* \) 
                          [\s\w]* 
                     )+
                     (?:
                          {
                          (?:
                               [\s\w]* 
                               % \( \w* \) 
                               [\s\w]* 
                          )+
                          }
                     )*
                     }
                )+                            # (1 end)
           )
           [\s\w]* 
      )+
 )
 $

Trying to match the line exactly from the start ^ to end $ with all these nested repetition operators ( * or + ) cause the catastrophic backtracking.

Remove the end anchor $ and simply check the length of the input string against the length of the match.

I've rewritten the regex to work alse in the cases where the optional sections were removed too:

^(?:[\w \t]*(?:{(?:[\w \t]*|%\(\w+\)|{(?:[\w \t]*|%\(\w+\))+})+})?)+

Online Demo

Legenda

^                              # Start of the line
(?:                            # OPEN NGM1 - Non matching group 1
  [\w \t]*                     # regex word char or space or tab (zero or more)
  (?:                          # OPEN NMG2
    {                          # A literal '{'
    (?:                        # OPEN NMG3 with alternation between:
      [\w \t]*|                # 1. regex word or space or tab (zero or more)
      %\(\w+\)|                # 2. A literal '%(' follower by regex word and literal ')'
      {(?:[\w \t]*|%\(\w+\))+} # 3. 
    )+                         # CLOSE NMG3 - Repeat one or more time
    }                          # A literal '}'
  )?                           # CLOSE NMG2 - Repeat zero or one time
)+                             # CLOSE NMG1 - Repeat one or more time

Regex Schema

正则表达式可视化

Js Demo

 var re = /^(?:[\\w \\t]*(?:{(?:[\\w \\t]*|%\\(\\w+\\)|{(?:[\\w \\t]*|%\\(\\w+\\))+})+})?)+/; var tests = ['OptionalStaticText{OptionalStaticText %(Placeholder) OptionalStaticText {OptionalSubSection} OptionalStaticText} OptionalStaticText', '{%(Placeholder) OptionalStaticText {OptionalSubSection}}', 'OptionalStaticText{%(Placeholder)} OptionalStaticText', 'abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd', 'abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(!ph3!) st33 {st31 %([ph4]) st332}} cd', 'abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} c-d', 'abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd']; var m; while(t = tests.pop()) { document.getElementById("r").innerHTML += '"' + t + "'<br/>"; document.getElementById("r").innerHTML += 'Valid string? ' + ( (t.match(re)[0].length == t.length) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>'; } 
 <div id="r"/> 

You could write a parser to parse such structured strings, and the parser itself would allow you to check the validity of the strings. For example (not complete):

var sample = "OptionalStaticText{OptionalStaticText %(Placholder) OptionalStaticText {OptionalSubSection} OptionalStaticText} OptionalStaticText";

function parse(str){

  return parseSection(str);

  function parseSection(str) {
    var section = new Section();
    var pointer = 0;

    while(!endOfSection()){

      if (placeHolderAhead()){
        section.push(parsePlaceHolder());
      } else if (sectionAhead()){
        section.push(parseInnerSection());
      } else {
        section.push(parseText());
      }
    }

    return section;

    function eat(token){
      if(str.substr(pointer, token.length) === token) {
        pointer += token.length;
        section.textLength += token.length;
      } else {
        throw ("Error: expected " + chr + " but found " + str.charAt(pointer));
      }
    }

    function parseInnerSection(){
      eat("{");
      var innerSection = parseSection(str.substr(pointer));
      pointer += innerSection.textLength;
      eat("}");
      return innerSection;
    }

    function endOfSection(){
      return (pointer >= str.length)
            || (str.charAt(pointer) === "}");
    }

    function placeHolderAhead(){
      return str.substr(pointer, 2) === "%(";
    }

    function sectionAhead(){
      return str.charAt(pointer) === "{";
    }

    function parsePlaceHolder(){
      var phText = "";
      eat("%(");
      while(str.charAt(pointer) !== ")") {
        phText += str.charAt(pointer);
        pointer++;
      }
      eat(")");
      return new PlaceHolder(phText);
    }

    function parseText(){
      var text = "";

      while(!endOfSection() && !placeHolderAhead() && !sectionAhead()){
        text += str.charAt(pointer);
        pointer++;
      }
      return text;
    }
  }
}

function Section(){
  this.isSection = true;
  this.contents = [];
  this.textLength = 0;

  this.push = function(elem){
    this.contents.push(elem);
    if(typeof elem === "string"){
      this.textLength += elem.length;
    } else if(elem.isSection || elem.isPlaceHolder) {
      this.textLength += elem.textLength;
    }
  }

  this.toString = function(indent){
    indent = indent || 0;
    var result = "";
    this.contents.forEach(function(elem){
      if(elem.isSection){
        result += elem.toString(indent+1);
      } else {
        result += Array((indent*8)+1).join(" ") + elem + "\n";
      }
    });
    return result;
  }
}

function PlaceHolder(text){
  this.isPlaceHolder = true;
  this.text = text;
  this.textLength = text.length;

  this.toString = function(){
    return "PlaceHolder: \"" + this.text + "\"";
  }
}


console.log(parse(sample).toString());

/* Prints:
OptionalStaticText
        OptionalStaticText 
        PlaceHolder: "Placholder"
         OptionalStaticText 
                OptionalSubSection
         OptionalStaticText
 OptionalStaticText
*/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM