简体   繁体   English

使用正则表达式使用Java查找任何数字或重复的单词

[英]Use regular expressions to find any number or repeated words using java

this is not homework. 这不是功课。 I'm just trying to learn/get better at regular expressions. 我只是想在正则表达式上学习/变得更好。

I'm trying to find 1 or more repeated words in a string. 我试图在一个字符串中找到1个或多个重复的单词。 Actually, I'm trying to find 1 or more repeated words in a string and remove the repeats. 实际上,我正在尝试在字符串中找到1个或多个重复的单词并删除重复的单词。 I've looked at link1 and link2 and tried using their pattern(s) but they don't seem to work for me. 我查看了link1link2并尝试使用它们的模式,但它们似乎对我不起作用。

Here is what I have 这是我所拥有的

String pattern = "\\b(\\w+)\\b\\s+\\1\\b";
Pattern p = Pattern.compile(pattern Pattern.CASE_INSENSITIVE);
//This is actually read from console
String input = "Goodbye bye bye world world world";
Matcher m = p.matcher(input);
while(m.fine())
{
    System.out.println("group: " + m.group() + " start: " + m.start() + " end: " + m.end());
    input = input.replaceAll(m.group(), m.group(1);
}
System.out.println(input);

And this is my output: 这是我的输出:
group: bye bye start: 8 end: 15 group(1): bye 组:再见开始:8结束:15组(1):再见
group: world world start: 16 end: 27 group(1): world 组:世界世界开始:16结束:27组(1):世界
Goodbye bye world world 再见世界世界

What I'm expecting for the 2nd line of output is "group: world world world start: 16 end: 32. 我期望输出的第二行是“组:世界世界世界起点:16终点:32。

So, to me, it seems like this is matching only the first repeated word. 因此,对我来说,这似乎只匹配第一个重复的单词。 My understanding of the pattern is \\b - word boundry, \\w+ - on or more of the word (I'm not sure if it's the word repeated WITHOUT a space, ie 'wordword' or one or more of the word repeated WITH a space ie' word word') then \\b\\s+ - followed by any white space \\1 - the grouped word and finally \\b - white space again. 我对模式的理解是\\ b -词boundry,\\ w + -或更多的词(我不知道这是否是重复没有空格的话,即“wordword”或一个或多个字重复空格,即“单词单词”),然后\\ b \\ s +-后跟任意空格\\ 1-分组的单词,最后是\\ b-空格。

Can some explain to me what's going on and what it should be? 有人可以向我解释发生了什么,应该发生什么?

Thanks! 谢谢!

You are mostly right in your understanding of the regex, except the regex is only checking for two words in a row, not two or more words in a row. 您对正则表达式的理解大体上是正确的,除了正则表达式仅检查连续两个单词,而不检查连续两个或更多单词。

To check for two or more words group the second part of your regex and put a plus after it so the word can be repeated more than twice like this: 要检查两个或两个以上的单词,请将正则表达式的第二部分分组,然后在其后面加上一个加号,这样单词可以重复两次以上,如下所示:

\\b(\\w+)\\b(\\s+\\1\\b)+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM