简体   繁体   中英

catastrophic backstring in regular expression

I am using below regular expression

[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*

and it showing me catastrophic backstring when i am trying to match with input string.

/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg

The expected output array of the matching regex will be like

[ 'w_100',
  'h_500',
  'e_saturation:50,e_tint:red:blue',
  'c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.',
  'l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc' ]

don't want to consider image name 1488800313_DSC_0334__3_.JPG_mweubp.jpg in match. the following

is there any method to solve this backstrack in regular expression or suggest me good regex for my input string.

The problem

You use a lot of alternations when a character class would be more effective. Also, you're getting the catastrophic backtracking due to the following quantifier:

[^\/]*([A-Za-z]+_([A-Za-z]+|-?[A-Z0-9]+(\.[A_Z0-9]+)?|(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?)+(?=(,|\/)))+|,[^\/]*
                                                                                                                           ^

It's trying to match any of the alternations you have, but keeps backtracking and never makes it past all your alternations (it's sometimes comparable to an infinite loop). In your case, your regex is so ineffective that it times out. I removed half your pattern and it takes a half second to complete with almost 200K steps (and that's only half your pattern).


Original Answer

How can it be fixed?

First step is to fix the quantifier and prevent it from continuously backtracking. This is actually quite easy, just make it possessive : + becomes ++ . Changing the quantifier to possessive yields a pattern that takes about 56ms to complete and approx 9K steps (on my computer)

Second step is to improve the efficiency of the pattern. Change your alternations to character classes where possible.

(?:_|:|:-|-[a-zA-Z]+|\.[a-zA-Z]+|[A-Z0-9a-z]+|=|\s|\?|\%|\.|!|#|\*)?
# should instead be
(?::-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+|[A-Z0-9a-z]+)?

It's much shorter, much more concise and less prone to errors.

The new pattern

See regex in use here

This pattern only takes 271 steps and less than one millisecond to complete (yes, using PCRE engine, works in Java too)

(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*]|[-.][a-zA-Z]+)++

I also changed your positive lookahead to a positive lookbehind (?<=[,\/]) to improve performance.


Additionally, if you don't need all the specific logic, you can quite simply use the following regex (just under half as many steps as my regex above):

See regex in use here

(?<=[,\/])[A-Za-z]+_[^,\/]+

Results

This results in the following array:

PS I'm assuming there'a a typo in your expected output and that the / between l_text and l_fetch should also be split on; needs clarification.

w_100
h_500
e_saturation:50
e_tint:red:blue
c_crop
a_100
l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc

Edit #1

The OP clarified the expected results. I added , to the character class in the fourth option of the non-capture group:

See regex in use here

(?<=[,\/])[A-Za-z]+_(?:[A-Z0-9a-z]+|-?[A-Z0-9]+(?:\.[A-Z0-9]+)?|:-|[_:-=\s?%.!#*,]|[-.][a-zA-Z]+)++

And in its shortened form:

See regex in use here

(?<=\/)[A-Za-z]+_[^\/]+

Results

This results in the following array:

w_100
h_500
e_saturation:50,e_tint:red:blue
c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.
l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc

Edit #2

The OP presented another input and identified issues with Edit #1 related to that input. I added logic to force a fail on the last item in a string.

New test string:

/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/sample_url_image.jpg

See regex in use here

(?<=\/)(?![A-Za-z]+_[^\/]+$)[A-Za-z]+_[^\/]+

Same results as in Edit #1.


PCRE version (if anyone is looking for it) - more efficient than the method above:

See regex in use here enter link description here

(?<=\/)[A-Za-z]+_[^\/]+(?:$(*SKIP)(*FAIL))?

Assuming your example has a typo, eg the last / would be split too:

You can simply split on / , then filter out the .jpg items:

function splitWithFilter(line, filter) {
    var filterRe = filter ? new RegExp(filter, 'i') : null;
    return line
    .replace(/^\//, '') // remove leading /
    .split(/\//)
    //.filter(Boolean)    // filter out empty items (alternative to above replace())
    .filter(function(item) {
        return !filterRe || !item.match(filterRe);
    });
}

var str = "/w_100/h_500/e_saturation:50,e_tint:red:blue/c_crop,a_100,l_text:Neucha_26_bold:Loremipsum./l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc/1488800313_DSC_0334__3_.JPG_mweubp.jpg";
console.log(JSON.stringify(splitWithFilter(str, '\\.jpg$'), null, ' '));

Expected output:

[
 "w_100",
 "h_500",
 "e_saturation:50,e_tint:red:blue",
 "c_crop,a_100,l_text:Neucha_26_bold:Loremipsum.",
 "l_fetch:aHR0cDovL2Nsb3VkaW5hcnkuY29tL2ltYWdlcy9vbGRfbG9nby5wbmc"
]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM