[英]Catastrophic backtracking issue with regular expression
I am new in working with Regular Expressions and currently facing a problem regarding that. 我是正在使用正则表达式的新手,目前面临着一个问题。
I am trying to build a regular expression that matches string in below format: 我正在尝试构建一个匹配以下格式的字符串的正则表达式:
OptionalStaticText{OptionalStaticText %(Placholder) OptionalStaticText {OptionalSubSection} OptionalStaticText} OptionalStaticText
Each Section
or Subsection
is denoted by {...}
. 每个Section
或Subsection
是由表示{...}
Each Placeholder
is denoted by %(...)
. 每个Placeholder
用%(...)
。 Each Section
or Subsection
can have arbitrary permutation of OptionalStaticText
, %(Placholder)
, and OptionalSubSection
. 每个Section
或Subsection
可以具有OptionalStaticText
, %(Placholder)
和OptionalSubSection
任意排列。
For this, I have created a regular expression which is as below, (also can be seen here ). 为此,我创建了一个正则表达式,如下所示(也可以在这里看到)。
/^(?:(?:(?:[\s\w])*(?:({(?:(?:[\s\w])*[%\(\w\)]+(?:[\s\w])*)+(?:{(?:(?:[\s\w])*[%\(\w\)]+(?:[\s\w])*)+})*})+)(?:[\s\w])*)+)$/g
This expression matches perfectly the valid strings (for example: abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd
as can be tested in the link given. 此表达式完全匹配有效字符串(例如: abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd
可以测试在给出的链接中。
However, it causes a timeout whenever, the input string in invalid(for example: abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd
, -
is not a valid character as per the [\\s\\w]
character group). 但是,只要输入字符串无效就会导致超时(例如: abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd
, -
不是[\\s\\w]
字符组的有效字符)。
Such invalid string causes timeout via Catastrophic backtracking, which can also be tested in the above link. 这种无效字符串会通过灾难性回溯导致超时,这也可以在上面的链接中进行测试。
I must have made some rookie mistake, but not sure what. 我一定是犯了一些菜鸟错误,但不确定是什么。 Is there a change I should make to avoid this? 我应该做些改变以避免这种情况吗?
Thank You. 谢谢。
If you have timeout issue its probably because of this [%\\(\\w\\)]+
如果你有超时问题,可能是因为这个[%\\(\\w\\)]+
which is a class of characters contained in the form you're looking for. 这是您正在寻找的表单中包含的一类字符。
Use the form itself instead. 请改用表单本身。
^(?:(?:[\\s\\w]*(?:({(?:[\\s\\w]*%\\(\\w*\\)[\\s\\w]*)+(?:{(?:[\\s\\w]*%\\(\\w*\\)[\\s\\w]*)+})*})+)[\\s\\w]*)+)$
^
(?:
(?:
[\s\w]*
(?:
( # (1 start)
{
(?:
[\s\w]*
% \( \w* \)
[\s\w]*
)+
(?:
{
(?:
[\s\w]*
% \( \w* \)
[\s\w]*
)+
}
)*
}
)+ # (1 end)
)
[\s\w]*
)+
)
$
Trying to match the line exactly from the start ^
to end $
with all these nested repetition operators ( *
or +
) cause the catastrophic backtracking. 尝试使用所有这些嵌套重复运算符( *
或+
)从开头^
到结束$
完全匹配行导致灾难性的回溯。
Remove the end anchor $
and simply check the length of the input string against the length of the match. 删除结束锚$
并简单地检查输入字符串的长度与匹配的长度。
I've rewritten the regex to work alse in the cases where the optional sections were removed too: 我已经重写了正则表达式,以便在删除可选部分的情况下:
^(?:[\w \t]*(?:{(?:[\w \t]*|%\(\w+\)|{(?:[\w \t]*|%\(\w+\))+})+})?)+
Legenda 通古斯
^ # Start of the line
(?: # OPEN NGM1 - Non matching group 1
[\w \t]* # regex word char or space or tab (zero or more)
(?: # OPEN NMG2
{ # A literal '{'
(?: # OPEN NMG3 with alternation between:
[\w \t]*| # 1. regex word or space or tab (zero or more)
%\(\w+\)| # 2. A literal '%(' follower by regex word and literal ')'
{(?:[\w \t]*|%\(\w+\))+} # 3.
)+ # CLOSE NMG3 - Repeat one or more time
} # A literal '}'
)? # CLOSE NMG2 - Repeat zero or one time
)+ # CLOSE NMG1 - Repeat one or more time
Regex Schema 正则表达式架构
Js Demo Js Demo
var re = /^(?:[\\w \\t]*(?:{(?:[\\w \\t]*|%\\(\\w+\\)|{(?:[\\w \\t]*|%\\(\\w+\\))+})+})?)+/; var tests = ['OptionalStaticText{OptionalStaticText %(Placeholder) OptionalStaticText {OptionalSubSection} OptionalStaticText} OptionalStaticText', '{%(Placeholder) OptionalStaticText {OptionalSubSection}}', 'OptionalStaticText{%(Placeholder)} OptionalStaticText', 'abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd', 'abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(!ph3!) st33 {st31 %([ph4]) st332}} cd', 'abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} c-d', 'abc {st1 %(ph1) st11} int {st2 %(ph2) st22}{st3 %(ph3) st33 {st31 %(ph4) st332}} cd']; var m; while(t = tests.pop()) { document.getElementById("r").innerHTML += '"' + t + "'<br/>"; document.getElementById("r").innerHTML += 'Valid string? ' + ( (t.match(re)[0].length == t.length) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>'; }
<div id="r"/>
You could write a parser to parse such structured strings, and the parser itself would allow you to check the validity of the strings. 你可以编写一个解析器来解析这样的结构化字符串,解析器本身可以让你检查字符串的有效性。 For example (not complete): 例如(不完整):
var sample = "OptionalStaticText{OptionalStaticText %(Placholder) OptionalStaticText {OptionalSubSection} OptionalStaticText} OptionalStaticText";
function parse(str){
return parseSection(str);
function parseSection(str) {
var section = new Section();
var pointer = 0;
while(!endOfSection()){
if (placeHolderAhead()){
section.push(parsePlaceHolder());
} else if (sectionAhead()){
section.push(parseInnerSection());
} else {
section.push(parseText());
}
}
return section;
function eat(token){
if(str.substr(pointer, token.length) === token) {
pointer += token.length;
section.textLength += token.length;
} else {
throw ("Error: expected " + chr + " but found " + str.charAt(pointer));
}
}
function parseInnerSection(){
eat("{");
var innerSection = parseSection(str.substr(pointer));
pointer += innerSection.textLength;
eat("}");
return innerSection;
}
function endOfSection(){
return (pointer >= str.length)
|| (str.charAt(pointer) === "}");
}
function placeHolderAhead(){
return str.substr(pointer, 2) === "%(";
}
function sectionAhead(){
return str.charAt(pointer) === "{";
}
function parsePlaceHolder(){
var phText = "";
eat("%(");
while(str.charAt(pointer) !== ")") {
phText += str.charAt(pointer);
pointer++;
}
eat(")");
return new PlaceHolder(phText);
}
function parseText(){
var text = "";
while(!endOfSection() && !placeHolderAhead() && !sectionAhead()){
text += str.charAt(pointer);
pointer++;
}
return text;
}
}
}
function Section(){
this.isSection = true;
this.contents = [];
this.textLength = 0;
this.push = function(elem){
this.contents.push(elem);
if(typeof elem === "string"){
this.textLength += elem.length;
} else if(elem.isSection || elem.isPlaceHolder) {
this.textLength += elem.textLength;
}
}
this.toString = function(indent){
indent = indent || 0;
var result = "";
this.contents.forEach(function(elem){
if(elem.isSection){
result += elem.toString(indent+1);
} else {
result += Array((indent*8)+1).join(" ") + elem + "\n";
}
});
return result;
}
}
function PlaceHolder(text){
this.isPlaceHolder = true;
this.text = text;
this.textLength = text.length;
this.toString = function(){
return "PlaceHolder: \"" + this.text + "\"";
}
}
console.log(parse(sample).toString());
/* Prints:
OptionalStaticText
OptionalStaticText
PlaceHolder: "Placholder"
OptionalStaticText
OptionalSubSection
OptionalStaticText
OptionalStaticText
*/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.