简体   繁体   中英

How to avoid Catastrophic Backtracking in RegExp?

I am trying to make a regular expression for string test.

Basically what I want is something-something .

'a' ===> TRUE
'abc' ===> TRUE
'a-b' ===> TRUE
'-' ===> FALSE
'a-' ===> FALSE
'-b' ===> FALSE

So the first version of this regexp is born.

/^[\w]+[-\s]?[\w]+$/

It is working fine, but it won't pass if the string is only one letter.

'a', failed

So I modified the pattern

^[\w]+([-\s]?[\w]+)*$

It's working but the browser hangs if the tested string is long (like 20+ letters), and yes, I know what's going on there, the Catastrophic Backtracking .

So in this scenario, how can I improve it?

UPDATE:

I think I missed one scenario, it should also support the repeat groups.

aaa aaa aaa aaa ===> TRUE
aaa-aaa aaa-aaa ===> TRUE

That's why I made the group with brackets.

This works for me, incorporated feedback from @VLAZ. Specifying the start ^ , end $ , and optional character grouping (-\w+)? were the key components to this.

EDIT : Incorporating the space involved changing (-\w+)? to ([-\s]\w+)* , which will match any sequence of characters following a space or hyphen and then at least one word character.

 const pattern = /^\w+([-\s]\w+)*$/; const tests = [ 'a', // ===> TRUE 'abc', // ===> TRUE 'a-b', // ===> TRUE, 'aaa aaa aaa aaa', // ===> TRUE 'aaa-aaa aaa-aaa', // ===> TRUE '-', // ===> FALSE 'a-', // ===> FALSE '-b', // ===> FALSE, ]; console.log(tests.map(test => pattern.test(test))); // performance const start = performance.now(); const perf = `${'a'.repeat(1000)}-${'a'.repeat(1000)} ${'b'.repeat(1000)}-${'b'.repeat(1000)}`; console.log(`${perf.length} char string took ${performance.now() - start}ms. Got result: ${pattern.test(perf)}`);

The issue you have is the double repeat in the pattern ([-\s]?[\w]+)* - you allow one or more \w and an optional space or dash. The group is also repeated zero or more times, that will lead to catastrophic backtracking because the optional [-\s] means there are many ways to match the same input. For example abc can be matched by (\w\w\w) , (\w\w)(\w) , (\w)(\w\w) , (\w)(\w)(\w) and the regex engine will try all of these possibilities because of the pattern ([-\s]?[\w]+)* (or to make it more obvious by removing the dash ([\w]+)* ) allows for it.

All of the possibilities will be tried when the end of the pattern there cannot be matched. For example, with the input "aaa-" - the last - will fail but the regex engine will keep backtracking and checking all permutations.

Instead, you can simplify your regex to this

/^\w+(?:[-\s]\w+)*$/
  1. You don't need character class for [\w] - if you only have one item in them. This wouldn't change anything but removing the square brackets makes it easier to read.
  2. If you don't the latter half of the pattern to be extracted, then you can use a non-capturing group - (?:) .
  3. Make the entire latter half of the regex optional. This means that you either match \w+ (one or more word characters) or the full \w+[-\s]\w+ . The engine will not be compelled to re-try failing matches.

The final step is the solution to the problem, the others are just slight cleanup. The important thing is that the pattern is restricted and it does not allow multiple ways to match a wrong input - the [-\s] is mandatory as is \w+ (at least one), therefore repeating the group (?:[-\s]\w+)* will not have overlapping matches. If we manually expand to ([-\s]\w\w\w) , ([-\s]\w\w)([-\s]\w) , and ([-\s]\w)([-\s]\w\w) it becomes easy to see that this will not match the same inputs.

 const regex = /^\w+(?:[-\s]\w+)*$/; const valid = [ 'a', 'abc', 'a-b', 'aaa aaa aaa aaa', 'aaa-aaa aaa-aaa', 'a'.repeat(100), `a-${'a'.repeat(100)}`, `a-${'a'.repeat(100)}-${'a'.repeat(100)}`, `a-${'a'.repeat(100)}-${'a'.repeat(100)}`, `a ${'a'.repeat(100)} ${'a'.repeat(100)}`, `a ${'a '.repeat(100)}a`, ] const invalid = [ '-', 'a-', '-b', 'aaa aaa aaa aaa-', `a-${'a'.repeat(100)}-${'a'.repeat(100)}-`, `a ${'a'.repeat(100)} ${'a'.repeat(100)} `, `a-${'-'.repeat(100)}`, `a ${' '.repeat(100)}`, `a-${'-'.repeat(100)}a`, `a ${'a '.repeat(100)}`, `-${'a'.repeat(100)}`, ` ${'a'.repeat(100)}`, `${'a'.repeat(100)}-`, `${'a'.repeat(100)} `, `a-${'a'.repeat(100)}-${'a'.repeat(100)}-`, `a-${'-'.repeat(100)}`, `a-${'a-'.repeat(100)}`, `-${'a'.repeat(100)}`, `${'a'.repeat(100)}-`, ] console.log('---- VALID ----'); for (const s of valid) test(s); console.log('---- INVALID ----'); for (const s of invalid) test(s); function test(str) { console.log(`${str} ===> ${regex.test(str)}`); }

Works and avoid Catastrophic Backtracking by using non-capturing group

^\w+(?:[-|\s]\w+)*$

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM