如何避免 RegExp 中的灾难性回溯？

Question

I am trying to make a regular expression for string test.我正在尝试为字符串测试制作正则表达式。

Basically what I want is something-something .基本上我想要的是something-something 。

'a' ===> TRUE
'abc' ===> TRUE
'a-b' ===> TRUE
'-' ===> FALSE
'a-' ===> FALSE
'-b' ===> FALSE

So the first version of this regexp is born.所以这个正则表达式的第一个版本诞生了。

/^[\w]+[-\s]?[\w]+$/

It is working fine, but it won't pass if the string is only one letter.它工作正常，但如果字符串只有一个字母，它就不会通过。

'a', failed

So I modified the pattern所以我修改了模式

^[\w]+([-\s]?[\w]+)*$

It's working but the browser hangs if the tested string is long (like 20+ letters), and yes, I know what's going on there, the Catastrophic Backtracking .它可以工作，但是如果测试的字符串很长（比如 20 多个字母），浏览器就会挂起，是的，我知道那里发生了什么，即Catastrophic Backtracking 。

So in this scenario, how can I improve it?那么在这种情况下，我该如何改进呢？

UPDATE:更新：

I think I missed one scenario, it should also support the repeat groups.我想我错过了一个场景，它也应该支持重复组。

aaa aaa aaa aaa ===> TRUE
aaa-aaa aaa-aaa ===> TRUE

That's why I made the group with brackets.这就是为什么我用括号创建组的原因。

Answer 1

This works for me, incorporated feedback from @VLAZ.这对我有用，包含来自@VLAZ 的反馈。 Specifying the start ^ , end $ , and optional character grouping (-\w+)?指定开始^ 、结束$和可选字符分组(-\w+)? were the key components to this.是其中的关键组成部分。

EDIT : Incorporating the space involved changing (-\w+)?编辑：合并涉及更改(-\w+)? to ([-\s]\w+)* , which will match any sequence of characters following a space or hyphen and then at least one word character.到([-\s]\w+)* ，它将匹配空格或连字符后面的任何字符序列，然后是至少一个单词字符。

 const pattern = /^\w+([-\s]\w+)*$/; const tests = [ 'a', // ===> TRUE 'abc', // ===> TRUE 'a-b', // ===> TRUE, 'aaa aaa aaa aaa', // ===> TRUE 'aaa-aaa aaa-aaa', // ===> TRUE '-', // ===> FALSE 'a-', // ===> FALSE '-b', // ===> FALSE, ]; console.log(tests.map(test => pattern.test(test))); // performance const start = performance.now(); const perf = `${'a'.repeat(1000)}-${'a'.repeat(1000)} ${'b'.repeat(1000)}-${'b'.repeat(1000)}`; console.log(`${perf.length} char string took ${performance.now() - start}ms. Got result: ${pattern.test(perf)}`);

Answer 2

The issue you have is the double repeat in the pattern ([-\s]?[\w]+)* - you allow one or more \w and an optional space or dash.您遇到的问题是模式([-\s]?[\w]+)*中的双重重复 - 您允许一个或多个\w和一个可选的空格或破折号。 The group is also repeated zero or more times, that will lead to catastrophic backtracking because the optional [-\s] means there are many ways to match the same input.该组也会重复零次或多次，这将导致灾难性的回溯，因为可选的[-\s]意味着有很多方法可以匹配相同的输入。 For example abc can be matched by (\w\w\w) , (\w\w)(\w) , (\w)(\w\w) , (\w)(\w)(\w) and the regex engine will try all of these possibilities because of the pattern ([-\s]?[\w]+)* (or to make it more obvious by removing the dash ([\w]+)* ) allows for it.例如abc可以匹配(\w\w\w) , (\w\w)(\w) , (\w)(\w\w) , (\w)(\w)(\w)并且正则表达式引擎将尝试所有这些可能性，因为模式([-\s]?[\w]+)* （或者通过删除破折号使其更明显([\w]+)* ）允许它。

All of the possibilities will be tried when the end of the pattern there cannot be matched.当模式的结尾无法匹配时，将尝试所有可能性。 For example, with the input "aaa-" - the last - will fail but the regex engine will keep backtracking and checking all permutations.例如，输入"aaa-" （最后一个-将失败，但正则表达式引擎将继续回溯并检查所有排列。

Instead, you can simplify your regex to this相反，您可以将您的正则表达式简化为此

/^\w+(?:[-\s]\w+)*$/

You don't need character class for [\w] - if you only have one item in them.对于[\w] ，您不需要字符 class - 如果您只有一项。 This wouldn't change anything but removing the square brackets makes it easier to read.这不会改变任何东西，但删除方括号使其更易于阅读。
If you don't the latter half of the pattern to be extracted, then you can use a non-capturing group - (?:) .如果您不提取模式的后半部分，则可以使用非捕获组- (?:) 。
Make the entire latter half of the regex optional.将正则表达式的整个后半部分设为可选。 This means that you either match \w+ (one or more word characters) or the full \w+[-\s]\w+ .这意味着您要么匹配\w+ （一个或多个单词字符），要么匹配完整的\w+[-\s]\w+ 。 The engine will not be compelled to re-try failing matches.引擎不会被迫重新尝试失败的匹配。

The final step is the solution to the problem, the others are just slight cleanup.最后一步是解决问题，其他只是轻微的清理。 The important thing is that the pattern is restricted and it does not allow multiple ways to match a wrong input - the [-\s] is mandatory as is \w+ (at least one), therefore repeating the group (?:[-\s]\w+)* will not have overlapping matches.重要的是该模式受到限制，并且它不允许多种方式来匹配错误的输入 - [-\s]与\w+ （至少一个）一样是强制性的，因此重复组(?:[-\s]\w+)*不会有重叠匹配。 If we manually expand to ([-\s]\w\w\w) , ([-\s]\w\w)([-\s]\w) , and ([-\s]\w)([-\s]\w\w) it becomes easy to see that this will not match the same inputs.如果我们手动展开为([-\s]\w\w\w) 、 ([-\s]\w\w)([-\s]\w)和([-\s]\w)([-\s]\w\w)很容易看出这将不匹配相同的输入。

 const regex = /^\w+(?:[-\s]\w+)*$/; const valid = [ 'a', 'abc', 'a-b', 'aaa aaa aaa aaa', 'aaa-aaa aaa-aaa', 'a'.repeat(100), `a-${'a'.repeat(100)}`, `a-${'a'.repeat(100)}-${'a'.repeat(100)}`, `a-${'a'.repeat(100)}-${'a'.repeat(100)}`, `a ${'a'.repeat(100)} ${'a'.repeat(100)}`, `a ${'a '.repeat(100)}a`, ] const invalid = [ '-', 'a-', '-b', 'aaa aaa aaa aaa-', `a-${'a'.repeat(100)}-${'a'.repeat(100)}-`, `a ${'a'.repeat(100)} ${'a'.repeat(100)} `, `a-${'-'.repeat(100)}`, `a ${' '.repeat(100)}`, `a-${'-'.repeat(100)}a`, `a ${'a '.repeat(100)}`, `-${'a'.repeat(100)}`, ` ${'a'.repeat(100)}`, `${'a'.repeat(100)}-`, `${'a'.repeat(100)} `, `a-${'a'.repeat(100)}-${'a'.repeat(100)}-`, `a-${'-'.repeat(100)}`, `a-${'a-'.repeat(100)}`, `-${'a'.repeat(100)}`, `${'a'.repeat(100)}-`, ] console.log('---- VALID ----'); for (const s of valid) test(s); console.log('---- INVALID ----'); for (const s of invalid) test(s); function test(str) { console.log(`${str} ===> ${regex.test(str)}`); }

Answer 3

Works and avoid Catastrophic Backtracking by using non-capturing group通过使用non-capturing group工作并避免Catastrophic Backtracking

^\w+(?:[-|\s]\w+)*$

如何避免 RegExp 中的灾难性回溯？

问题描述

UPDATE:更新：

3 个解决方案

解决方案1
0 2020-07-03 06:02:50

解决方案2
0 已采纳 2020-07-03 06:24:27

解决方案3
0 2020-07-03 06:50:53

如何避免 RegExp 中的灾难性回溯？

问题描述

UPDATE:更新：

3 个解决方案

解决方案1 0 2020-07-03 06:02:50

解决方案2 0 已采纳 2020-07-03 06:24:27

解决方案3 0 2020-07-03 06:50:53

解决方案1
0 2020-07-03 06:02:50

解决方案2
0 已采纳 2020-07-03 06:24:27

解决方案3
0 2020-07-03 06:50:53