简体   繁体   English

正则表达式中的灾难性反串

[英]catastrophic backstring in regular expression

I am using below regular expression我正在使用下面的正则表达式

[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*

and it showing me catastrophic backstring when i am trying to match with input string.当我尝试与输入字符串匹配时,它向我展示了灾难性的反串。

/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg

The expected output array of the matching regex will be like匹配的正则表达式的预期 output 数组将像

[ 'w_100',
  'h_500',
  'e_saturation:50,e_tint:red:blue',
  'c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.',
  'l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc' ]

don't want to consider image name 1488800313_DSC_0334__3_.JPG_mweubp.jpg in match.不想在匹配中考虑图像名称 1488800313_DSC_0334__3_.JPG_mweubp.jpg。 the following以下

is there any method to solve this backstrack in regular expression or suggest me good regex for my input string.有什么方法可以在正则表达式中解决这个回溯问题,或者为我的输入字符串建议好的正则表达式。

The problem问题

You use a lot of alternations when a character class would be more effective.当字符 class 更有效时,您会使用很多交替。 Also, you're getting the catastrophic backtracking due to the following quantifier:此外,由于以下量词,您将获得灾难性的回溯:

[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
                                                                                                                           ^

It's trying to match any of the alternations you have, but keeps backtracking and never makes it past all your alternations (it's sometimes comparable to an infinite loop).它试图匹配您拥有的任何交替,但会不断回溯,并且永远不会超过您的所有交替(有时可以与无限循环相媲美)。 In your case, your regex is so ineffective that it times out.在您的情况下,您的正则表达式非常无效,以至于超时。 I removed half your pattern and it takes a half second to complete with almost 200K steps (and that's only half your pattern).我删除了你的一半模式,它需要半秒才能完成近 200K 步(这只是你模式的一半)。


Original Answer原始答案

How can it be fixed?如何修复?

First step is to fix the quantifier and prevent it from continuously backtracking.第一步是修复量词并防止它不断回溯。 This is actually quite easy, just make it possessive : + becomes ++ .这实际上很容易,只需将其设为所有格: +变为++ Changing the quantifier to possessive yields a pattern that takes about 56ms to complete and approx 9K steps (on my computer)将量词更改为所有格会产生一个大约需要 56 毫秒才能完成的模式和大约 9K 步(在我的计算机上)

Second step is to improve the efficiency of the pattern.第二步是提高模式的效率。 Change your alternations to character classes where possible.尽可能将您的交替更改为字符类。

(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?
# should instead be
(?::-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+|[A-Z0-9a-z]+)?

It's much shorter, much more concise and less prone to errors.它更短,更简洁,更不容易出错。

The new pattern新模式

See regex in use here请参阅此处使用的正则表达式

This pattern only takes 271 steps and less than one millisecond to complete (yes, using PCRE engine, works in Java too)此模式只需要 271 步,不到一毫秒即可完成(是的,使用 PCRE 引擎,也适用于 Java)

(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+)++

I also changed your positive lookahead to a positive lookbehind (?<=[,\/]) to improve performance.我还将您的积极前瞻更改为积极后瞻(?<=[,\/])以提高性能。


Additionally, if you don't need all the specific logic, you can quite simply use the following regex (just under half as many steps as my regex above):此外,如果您不需要所有特定逻辑,则可以非常简单地使用以下正则表达式(仅比我上面的正则表达式少一半的步骤):

See regex in use here请参阅此处使用的正则表达式

(?<=[,\/])[A-Za-z]+_[^,\/]+

Results结果

This results in the following array:这将产生以下数组:

PS I'm assuming there'a a typo in your expected output and that the / between l_text and l_fetch should also be split on; PS我假设您预期的 output 中有错字,并且l_textl_fetch之间的/也应该分开; needs clarification.需要澄清。

w_100
h_500
e_saturation:50
e_tint:red:blue
c_crop
a_100
l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc

Edit #1编辑#1

The OP clarified the expected results. OP 澄清了预期结果。 I added , to the character class in the fourth option of the non-capture group:我在非捕获组的第四个选项中将,添加到字符 class 中:

See regex in use here请参阅此处使用的正则表达式

(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*,]|[-.][a-zA-Z]+)++

And in its shortened form:并以其缩写形式:

See regex in use here请参阅此处使用的正则表达式

(?<=\/)[A-Za-z]+_[^\/]+

Results结果

This results in the following array:这将产生以下数组:

w_100
h_500
e_saturation:50,e_tint:red:blue
c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc

Edit #2编辑#2

The OP presented another input and identified issues with Edit #1 related to that input. OP 提出了另一个输入并确定了与该输入相关的编辑#1 的问题。 I added logic to force a fail on the last item in a string.我添加了逻辑来强制字符串中的最后一项失败。

New test string:新的测试字符串:

/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/sample_url_image.jpg

See regex in use here请参阅此处使用的正则表达式

(?<=\/)(?![A-Za-z]+_[^\/]+$)[A-Za-z]+_[^\/]+

Same results as in Edit #1.与编辑#1 中的结果相同。


PCRE version (if anyone is looking for it) - more efficient than the method above: PCRE 版本(如果有人正在寻找它) - 比上述方法更有效:

See regex in use here enter link description here请参阅此处使用的正则表达式 在此处输入链接描述

(?<=\/)[A-Za-z]+_[^\/]+(?:$(*SKIP)(*FAIL))?

Assuming your example has a typo, eg the last / would be split too:假设您的示例有错字,例如最后一个/也会被拆分:

You can simply split on / , then filter out the .jpg items:您可以简单地拆分/ ,然后过滤掉.jpg项目:

function splitWithFilter(line, filter) {
    var filterRe = filter ? new RegExp(filter, 'i') : null;
    return line
    .replace(/^\//, '') // remove leading /
    .split(/\//)
    //.filter(Boolean)    // filter out empty items (alternative to above replace())
    .filter(function(item) {
        return !filterRe || !item.match(filterRe);
    });
}

var str = "/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg";
console.log(JSON.stringify(splitWithFilter(str, '\\.jpg$'), null, ' '));

Expected output:预期 output:

[
 "w_100",
 "h_500",
 "e_saturation:50,e_tint:red:blue",
 "c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.",
 "l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc"
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM