简体   繁体   English

从字符串中删除重复的单词

[英]Removing duplicates words from a string

I have a string like: 我有一个像这样的字符串:

Hello how how how are are you you? 你好,你好吗?

I love cookies cookies, apples and pancakes pancakes. 我喜欢饼干,饼干,苹果和煎饼煎饼。

I wish for an output: 我希望得到一个输出:

Hello how are you? 你好,你好吗?

I love cookies, apples and pancakes. 我喜欢饼干,苹果和煎饼。

Till now I have coded: 到目前为止,我已经编码:

String[] s = input.split(" ");
String prev = s[0];
String ans = prev + " ";

for (int i = 1; i < s.length; i++) {

    if (!prev.equals(s[i])) {
        prev = s[i];
        ans += prev + " ";
    }
}

System.out.println(ans);

I get outputs as: 我得到的输出为:

Hello how are you you? 你好,你好吗?

I love cookies cookies, apples and pancakes pancakes. 我喜欢饼干,饼干,苹果和煎饼煎饼。

I need some help with the logic for , . ! ? . 我需要有关的逻辑的帮助, . ! ? . , . ! ? . .. ..

you can use regex to do this for you. 您可以使用regex为您完成此操作。 sample code: 样例代码:

String regex = "\\b(\\w+)\\b\\s*(?=.*\\b\\1\\b)";
input = input.replaceAll(regex,"");
  1. \\b Matches a word boundary position between a word character and non-word character or position (start / end of string). \\b匹配单词字符和非单词字符之间的单词边界位置或位置(字符串的开始/结尾)。
  2. \\w Matches any word character (alphanumeric & underscore). \\w匹配任何单词字符(字母数字和下划线)。
  3. \\b Matches a word boundary position between a word character and non-word character or position (start / end of string). \\b匹配单词字符和非单词字符之间的单词边界位置或位置(字符串的开始/结尾)。
  4. \\s Matches any whitespace character (spaces, tabs, line breaks). \\s匹配任何空格字符(空格,制表符,换行符)。
  5. * Match 0 or more of the preceding token. *匹配0或多个前面的令牌。
  6. (?= Matches a group after the main expression without including it in the result. (?=在主表达式之后匹配一个组,但不将其包括在结果中。
  7. . Matches any character except line breaks. 匹配除换行符以外的任何字符。
  8. \\1 Matches the results of capture group #1 in step 2. \\1匹配步骤2中捕获组#1的结果。

Note: It is important to use word boundaries here to avoid matching partial words. 注意:在这里使用单词边界很重要,以避免匹配部分单词。

Here's a link to regex demo and explaination : RegexDemo 这是正则表达式演示和说明的链接: RegexDemo

You should use a secondary variable to store your words without the punctuation. 您应该使用一个辅助变量来存储您的单词而不使用标点符号。

String[] s = input.split(" ");
String ans = "";

for (int i = 0; i < s.length - 1; i++) {

    String currentAux = s[i].replaceAll("[,.!?]", "");
    String nextAux = s[i + 1].replaceAll("[,.!?]", "");

    if (nextAux.equals(currentAux)) {
        continue;
    }

    ans += " " + s[i];
}

ans += " " + s[s.length - 1];

System.out.println(ans);

You can use java.util.StringTokenizer to tokenize the words. 您可以使用java.util.StringTokenizer标记单词。 Make sure to set the delimiters to split the words. 确保设置分隔符以分割单词。 In your case they are spaces, commas and full stops. 在您的情况下,它们是空格,逗号和句号。 This can help you to split the words without the punctuation marks. 这可以帮助您拆分不带标点符号的单词。 Then you can compare the previous token with the current and if they are equal you can ignore it. 然后,您可以将前一个令牌与当前令牌进行比较,如果它们相等,则可以忽略它。

You can try this code snippet: 您可以尝试以下代码片段:

String s = "I love cookies cookies, apples and pancakes pancakes.";

StringTokenizer tokenizer = new StringTokenizer(s, " ,.", true);

List<String> duplicateRemovedTokenList = new LinkedList<>();

String prevToken = null;

while (tokenizer.hasMoreTokens()) {

    String currentToken = tokenizer.nextToken();

    if (currentToken.equals(" ")) {
        duplicateRemovedTokenList.add(currentToken);
        continue;
    }

    if (!currentToken.equals(prevToken)) {
        duplicateRemovedTokenList.add(currentToken);
        prevToken = currentToken;
    }
}

String duplicateRemovedString = StringUtils.join(duplicateRemovedTokenList, "");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM