简体   繁体   English

正则表达式查找用空格分隔的单词,回溯

[英]Regular Expression to find words separated with space, backtracking

I have to find words separated by space. 我必须找到空格隔开的单词。 What best practice to do it with the smallest backtracking? 有什么最佳实践来实现最小的回溯?

I found this solution: 我找到了这个解决方案:

Regex: \d+\s([a-zA-Z]+\\s{0,1}){1,} in a sentence
Input: 1234 this is words in a sentence

So, this is words - i have to check using regex ([a-zA-Z]+\\\\s{0,1}){1,} and words in a sentence i have to check by constant words in regex in a sentences . 因此, this is words -我必须使用正则表达式([a-zA-Z]+\\\\s{0,1}){1,}进行检查,而in a sentence我必须通过正则表达式in a sentences常量单词进行检查in a sentences

But in this case regex101.com gives me debug with 4156 steps and this is Catastrophic Backtracking. 但是在这种情况下,regex101.com使我可以进行4156步调试,这就是灾难性的回溯。 Any way to avoid it? 有什么办法避免呢?

I have other more complicated example, where it takes 86000 steps and it does not validate. 我还有另一个更复杂的示例,它需要86000个步骤,并且没有经过验证。

Main problem, that i have to find all words separated by space, but in the same time regex contains words separated by space (constants). 主要问题是,我必须找到所有用空格分隔的单词,但同时正则表达式包含用空格分隔的单词(常量)。 This is where i have Catastrophic Backtracking. 这是我发生灾难性回溯的地方。

I have to do this using Java. 我必须使用Java做到这一点。

You could try splitting the String into a String array, then find the size of the array after eliminating any members of the array that do not match your definition of a word (ex. a whitespace or puncuation) 您可以尝试将String拆分为String数组,然后在消除与单词定义不匹配的任何数组成员(例如空格或标点符号)之后找到数组的大小

String[] mySplitString = myOriginalString.split(" ");
for(int x = 0; x < mySplitString.length; x++){
    if(mySplitString[x].matches("\\w.*"/*Your regex for a word here*/)) words++;
}

mySplitString is an array of Strings that have been split from an original string. mySplitString是已从原始字符串拆分出的字符串数组。 All whitespace characters are removed and substrings that were before, after, or in-between whitespaces are placed into the new String array. 删除所有空白字符,并将之前,之后或之间的子字符串放入新的String数组中。 The for-loop runs through the split String array and checks to make sure that each array member contains a word (characters or numbers atleast once) and adds it to a total word count. for循环遍历已拆分的String数组,并检查以确保每个数组成员包含一个单词(至少一个字符或数字)并将其添加到总单词数中。

You want to find words separated by space .So you should say at least 1 or more space .You can use this instead which takes just 37 steps. 您想找到由space分隔的单词。因此,您应该说至少1 or more space 。您可以使用此步骤,只需37个步骤。

\d+\s([a-zA-Z]+\s+)+in a sentence

See demo. 参见演示。

https://regex101.com/r/tD0dU9/4 https://regex101.com/r/tD0dU9/4

For java double escape all ie \\d==\\\\d 对于Java双重转义所有,即\\d==\\\\d

If I understood it right, you want to match any word separeted by space plus the sentence "in a sentence". 如果我理解正确,那么您想匹配任何用空格分隔的单词,再加上句子“ in a句子”。

You can try the following solution: 您可以尝试以下解决方案:

(in a sentence)|(\S+)

As seen in this example on regex101: Exemple 可以看出在regex101这个例子: 为例

The regex matchs in 61 steps. 正则表达式匹配61个步骤。 You might have problems with punctuation after the "in a sentence" sentence. 您可能在句子中的标点符号后出现问题。 Make some tests. 做一些测试。

I hope I was helpfull. 我希望我会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM