简体   繁体   中英

Pattern, matcher in Java, REGEX help

I'm trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this:

Pattern p = Pattern.compile("(\\w+) \\1");
StringBuilder sb = new StringBuilder(1000);
int i = 0;
for (String s : lineOfWords) { // line of words is a List<String> that has each line read in from txt file
Matcher m = p.matcher(s.toUpperCase());
// and then do something like
while (m.find()) {
  // do something here
}

I tried looking at the m.end to see if I could create a new string, or remove the item(s) where the matches are, but I wasn't sure how it works after reading the documentation. For example, as a test case to see how it worked, I did:

if (m.find()) {
System.out.println(s.substring(i, m.end()));
    }

To the text file that has: This is an example example test test test.

Why is my output This is ?

Edit:

if I have an AraryList lineOfWords that reads each line from a line of .txt file and then I create a new ArrayList to hold the modified string. For example

List<String> newString = new ArrayList<String>();
for (String s : lineOfWords { 
   s = s.replaceAll( code from Kobi here);
   newString.add(s);
} 

but then it doesn't give me the new s, but the original s. Is it because of shallow vs deep copy?

Try something like:

s = s.replaceAll("\\b(\\w+)\\b(\\s+\\1)+\\b", "$1");

That regex is a bit stronger than yours - it checks for whole words (no partial matches), and gets rid of any number of consecutive repetitions.
The regex captures a first word: \\b(\\w+)\\b , and then attempts to match spaces and repetitions of that word: (\\s+\\1)+ . The final \\b is to avoid partial matching of \\1 , as in "for formatting" .

The first match is "Th an example...", so m.end() points to the end of the second "is". 一个例子......”,所以m.end()指向第二个“是”的结尾。 I'm not sure why you use i for the start index; try m.start() instead.

To improve your regex, use \\b before and after the word to indicate that there should be word boundaries: (\\\\b\\\\w+\\\\b) . Otherwise, as you're seeing, you'll get matches inside of words.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM