模式，Java中的匹配器，REGEX帮助

Question

I'm trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this: 我试图从文本文件中删除重复的连续单词，有人提到我可以做这样的事情：

Pattern p = Pattern.compile("(\\w+) \\1");
StringBuilder sb = new StringBuilder(1000);
int i = 0;
for (String s : lineOfWords) { // line of words is a List<String> that has each line read in from txt file
Matcher m = p.matcher(s.toUpperCase());
// and then do something like
while (m.find()) {
  // do something here
}

I tried looking at the m.end to see if I could create a new string, or remove the item(s) where the matches are, but I wasn't sure how it works after reading the documentation. 我试着查看m.end以查看是否可以创建新字符串，或者删除匹配项的项目，但在阅读文档后我不确定它是如何工作的。 For example, as a test case to see how it worked, I did: 例如，作为一个测试案例，看看它是如何工作的，我做了：

if (m.find()) {
System.out.println(s.substring(i, m.end()));
    }

To the text file that has: This is an example example test test test. 对于具有以下内容的文本文件： This is an example example test test test.

Why is my output This is ? 为什么我的输出This is ？

Edit: 编辑：

if I have an AraryList lineOfWords that reads each line from a line of .txt file and then I create a new ArrayList to hold the modified string. 如果我有一个AraryList lineOfWords从.txt文件的行读取每一行，然后我创建一个新的ArrayList来保存修改后的字符串。 For example 例如

List<String> newString = new ArrayList<String>();
for (String s : lineOfWords { 
   s = s.replaceAll( code from Kobi here);
   newString.add(s);
}

but then it doesn't give me the new s, but the original s. 但是它不会给我新的s，而是原来的s。 Is it because of shallow vs deep copy? 是因为浅拷贝和深拷贝？

Answer 1

Try something like: 尝试类似的东西：

s = s.replaceAll("\\b(\\w+)\\b(\\s+\\1)+\\b", "$1");

That regex is a bit stronger than yours - it checks for whole words (no partial matches), and gets rid of any number of consecutive repetitions. 那个正则表达式比你的强一点 - 它检查整个单词（没有部分匹配），并且去掉任意数量的连续重复。
The regex captures a first word: \\b(\\w+)\\b , and then attempts to match spaces and repetitions of that word: (\\s+\\1)+ . 正则表达式捕获第一个单词： \\b(\\w+)\\b ，然后尝试匹配该单词的空格和重复： (\\s+\\1)+ 。 The final \\b is to avoid partial matching of \\1 , as in "for formatting" . 最后的\\b是为了避免\\1部分匹配，如"for formatting" 。

Answer 2

The first match is "Th IS IS an example...", so m.end() points to the end of the second "is". 第一场比赛是“这是一个例子......”，所以m.end()指向第二个“是”的结尾。 I'm not sure why you use i for the start index; 我不确定为什么你用i作为起始索引; try m.start() instead. 尝试m.start()代替。

To improve your regex, use \\b before and after the word to indicate that there should be word boundaries: (\\\\b\\\\w+\\\\b) . 要改进正则表达式，请在单词前后使用\\b表示应该有单词边界： (\\\\b\\\\w+\\\\b) 。 Otherwise, as you're seeing, you'll get matches inside of words. 否则，正如您所看到的，您将获得内容匹配。

模式，Java中的匹配器，REGEX帮助

问题描述

2 个解决方案

解决方案1
3 已采纳 2010-08-04 04:52:57

解决方案2
1 2010-08-04 04:51:31

模式，Java中的匹配器，REGEX帮助

问题描述

2 个解决方案

解决方案1 3 已采纳 2010-08-04 04:52:57

解决方案2 1 2010-08-04 04:51:31

解决方案1
3 已采纳 2010-08-04 04:52:57

解决方案2
1 2010-08-04 04:51:31