[英]Regular Expression to find words separated with space, backtracking
I have to find words separated by space. 我必须找到空格隔开的单词。 What best practice to do it with the smallest backtracking?
有什么最佳实践来实现最小的回溯?
I found this solution: 我找到了这个解决方案:
Regex: \d+\s([a-zA-Z]+\\s{0,1}){1,} in a sentence
Input: 1234 this is words in a sentence
So, this is words
- i have to check using regex ([a-zA-Z]+\\\\s{0,1}){1,}
and words in a sentence
i have to check by constant words in regex in a sentences
. 因此,
this is words
-我必须使用正则表达式([a-zA-Z]+\\\\s{0,1}){1,}
进行检查,而in a sentence
我必须通过正则表达式in a sentences
常量单词进行检查in a sentences
。
But in this case regex101.com gives me debug with 4156 steps and this is Catastrophic Backtracking. 但是在这种情况下,regex101.com使我可以进行4156步调试,这就是灾难性的回溯。 Any way to avoid it?
有什么办法避免呢?
I have other more complicated example, where it takes 86000 steps and it does not validate. 我还有另一个更复杂的示例,它需要86000个步骤,并且没有经过验证。
Main problem, that i have to find all words separated by space, but in the same time regex contains words separated by space (constants). 主要问题是,我必须找到所有用空格分隔的单词,但同时正则表达式包含用空格分隔的单词(常量)。 This is where i have Catastrophic Backtracking.
这是我发生灾难性回溯的地方。
I have to do this using Java. 我必须使用Java做到这一点。
You could try splitting the String into a String array, then find the size of the array after eliminating any members of the array that do not match your definition of a word (ex. a whitespace or puncuation) 您可以尝试将String拆分为String数组,然后在消除与单词定义不匹配的任何数组成员(例如空格或标点符号)之后找到数组的大小
String[] mySplitString = myOriginalString.split(" ");
for(int x = 0; x < mySplitString.length; x++){
if(mySplitString[x].matches("\\w.*"/*Your regex for a word here*/)) words++;
}
mySplitString is an array of Strings that have been split from an original string. mySplitString是已从原始字符串拆分出的字符串数组。 All whitespace characters are removed and substrings that were before, after, or in-between whitespaces are placed into the new String array.
删除所有空白字符,并将之前,之后或之间的子字符串放入新的String数组中。 The for-loop runs through the split String array and checks to make sure that each array member contains a word (characters or numbers atleast once) and adds it to a total word count.
for循环遍历已拆分的String数组,并检查以确保每个数组成员包含一个单词(至少一个字符或数字)并将其添加到总单词数中。
You want to find words separated by space
.So you should say at least 1 or more space
.You can use this instead which takes just 37 steps. 您想找到由
space
分隔的单词。因此,您应该说至少1 or more space
。您可以使用此步骤,只需37个步骤。
\d+\s([a-zA-Z]+\s+)+in a sentence
See demo. 参见演示。
https://regex101.com/r/tD0dU9/4 https://regex101.com/r/tD0dU9/4
For java double escape all ie \\d==\\\\d
对于Java双重转义所有,即
\\d==\\\\d
If I understood it right, you want to match any word separeted by space plus the sentence "in a sentence". 如果我理解正确,那么您想匹配任何用空格分隔的单词,再加上句子“ in a句子”。
You can try the following solution: 您可以尝试以下解决方案:
(in a sentence)|(\S+)
As seen in this example on regex101: Exemple 可以看出在regex101这个例子: 为例
The regex matchs in 61 steps. 正则表达式匹配61个步骤。 You might have problems with punctuation after the "in a sentence" sentence.
您可能在句子中的标点符号后出现问题。 Make some tests.
做一些测试。
I hope I was helpfull. 我希望我会有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.