简体   繁体   English

Java-从字符串中删除重复项

[英]Java - Remove duplicates from a string

I have a string with a list of values separated by semicolon. 我有一个字符串,其中包含用分号分隔的值列表。 I need some optimal method to remove duplicates. 我需要一些最佳方法来删除重复项。 I have following regular expression: 我有以下正则表达式:

\b(\w+);(?=.*\b\1;?)

This works, but it fails when there are white spaces. 这可以工作,但是在有空格的情况下会失败。 For example aaa bbb;aaa bbb;aaa bbb creates aaa aaa aaa bbb instead of aaa bbb . 例如aaa bbb;aaa bbb;aaa bbb创建aaa aaa aaa bbb而不是aaa bbb

Probably simplest solution would be using Sets - collection which doesn't allow duplicates. 可能最简单的解决方案是使用Sets-不允许重复的集合。 Split your string on delimiter, and place it in set. 在分隔符上分割您的字符串,并将其放在set中。

In Java 8 your code can look like: 在Java 8中,您的代码如下所示:

String result = Stream.of(yourText.split(";"))          //stream of elements separated by ";"
                      .distinct()                       //remove duplicates in stream
                      .collect(Collectors.joining(";"));//create String joining rest of elements using ";"

Pre Java 8 solution can look like: Java 8之前的解决方案如下所示:

public String removeDuplicates(String yourText) {
    Set<String> elements = new LinkedHashSet<>(Arrays.asList(yourText.split(";")));

    Iterator<String> it = elements.iterator();

    StringBuilder sb = new StringBuilder(it.hasNext() ? it.next() : "");
    while (it.hasNext()) {
        sb.append(';').append(it.next());
    }

    return sb.toString();
}

This can be implemented in multiple ways. 这可以以多种方式实现。 As already mentioned previously, a HashSet is the right way to go. 如前所述,HashSet是正确的方法。 As you state that you need an "optimal" solution I took the time to optimize and benchmark several implementations. 当您声明需要“最佳”解决方案时,我花了一些时间来优化和基准化多个实现。

We start with the pre-Java 8 solution by Pshemo: 我们从Pshemo的Java 8之前的解决方案开始:

public static String distinct0(String yourText) {
    Set<String> elements = new LinkedHashSet<>(Arrays.asList(yourText.split(";")));
    Iterator<String> it = elements.iterator();
    StringBuilder sb = new StringBuilder(it.hasNext() ? it.next() : "");
    while (it.hasNext()) {
        sb.append(';').append(it.next());
    }
    return sb.toString();
}

This implementation uses String.split() which creates an array of Strings. 此实现使用String.split()创建字符串数组。 This array is then converted to a List, which is added to a LinkedHashSet. 然后将该数组转换为一个列表,该列表将添加到LinkedHashSet中。 The LinkedHashSet preserves the order in which elements have been added by maintaining an additional linked list. LinkedHashSet通过维护其他链接列表来保留添加元素的顺序。 Next, an iterator is used to enumerate the elements from the set, which are then concatenated with a StringBuilder. 接下来,使用迭代器枚举集合中的元素,然后将这些元素与StringBuilder串联。

We can slightly optimize this method by realizing that we can already build the result while iterating over the individual elements in the input string. 我们认识到我们可以在迭代输入字符串中的各个元素时就已经构建了结果,因此可以略微优化此方法。 It is thus not necessary to store information about the order in which distinct strings have been found. 因此,不必存储有关找到不同字符串的顺序的信息。 This eliminates the need for a LinkedHashSet (and the Iterator): 这消除了对LinkedHashSet(和Iterator)的需要:

public static String distinct1(String elements){
    StringBuilder builder = new StringBuilder();
    Set<String> set = new HashSet<String>();
    for (String value : elements.split(";")) {
        if (set.add(value)) {
            builder.append(set.size() != 1 ? ";" : "").append(value);
        }
    }
    return builder.toString();
}

Next, we can get rid of String.split() and thus avoid creating an intermediate array containing all elements from the input string: 接下来,我们可以摆脱String.split(),从而避免创建一个包含输入字符串中所有元素的中间数组:

public static String distinct2(String elements){

    char[] array = elements.toCharArray();
    StringBuilder builder = new StringBuilder();
    Set<String> set = new HashSet<String>();
    int last = 0;
    for (int index=0; index<array.length; index++) {
        if (array[index] == ';') {
            String value = new String(array, last, (index-last));
            if (set.add(value)) {
                builder.append(last != 0 ? ";" : "").append(value);
            }
            last = index + 1;
        }
    }
    return builder.toString();
}

Finally, we can get rid of unneccessary memory allocations by not constructing String objects for the individual elements, as the constructor String(array, offset, length) (which is also used by String.split()) will call Arrays.copyOfRange(...) to allocate a new char[]. 最后,我们可以通过不为单个元素构造String对象来摆脱不必要的内存分配,因为构造函数String(array,offset,length)(也由String.split()使用)将调用Arrays.copyOfRange(。 ..)分配新的char []。 To avoid this overhead, we can implement a wrapper around the input char[] which implements hashCode() and equals() for a given range. 为了避免这种开销,我们可以在输入char []周围实现包装器,该包装器在给定范围内实现hashCode()和equals()。 This can be used to detect wether a certain string is already contained in the result. 这可用于检测结果中是否已包含某个字符串。 Additionally, this method allows us to use StringBuilder.append(array, offset, length), which simply reads data from the provided array: 另外,此方法允许我们使用StringBuilder.append(array,offset,length),该方法仅从提供的数组中读取数据:

public static String distinct3(String elements){

    // Prepare
    final char[] array = elements.toCharArray();
    class Entry {
        final int from;
        final int to;
        final int hash;

        public Entry(int from, int to) {
            this.from = from;
            this.to = to;
            int hash = 0;
            for (int i = from; i < to; i++) {
                hash = 31 * hash + array[i];
            }
            this.hash = hash;
        }

        @Override
        public boolean equals(Object object) {
            Entry other = (Entry)object;
            if (other.to - other.from != this.to - this.from) {
                return false;
            }
            for (int i=0; i < this.to - this.from; i++) {
                if (array[this.from + i] != array[other.from + i]) {
                    return false;
                }
            }
            return true;
        }

        @Override
        public int hashCode() {
            return hash;
        }
    }

    // Remove duplicates
    StringBuilder builder = new StringBuilder();
    Set<Entry> set = new HashSet<Entry>();
    int last = 0;
    for (int index=0; index<array.length; index++) {
        if (array[index] == ';') {
            Entry value = new Entry(last, index);
            if (set.add(value)) {
                builder.append(last != 0 ? ";" : "").append(array, last, index-last);
            }
            last = index + 1;
        }
    }
    return builder.toString();
}

I compared these implementations with the following code: 我将这些实现与以下代码进行了比较:

public static void main(String[] args) {

    int REPETITIONS = 10000000;
    String VALUE = ";aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
                   "aaa bbb;;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
                   "aaa bbb;aaa bbb;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb";

    long time = System.currentTimeMillis();
    String result = null;
    for (int i = 0; i < REPETITIONS; i++) {
        result = distinct0(VALUE);
    }
    System.out.println(result + " - " + (double) (System.currentTimeMillis() - time) /
                                        (double) REPETITIONS + " [ms] per call");
}

Which gave me the following results when running it on my machine with JDK 1.7.0_51: 在使用JDK 1.7.0_51在我的机器上运行它时,给了我以下结果:

  • distinct0: 0.0021881 [ms] per call 与众不同的0:每次通话0.0021881 [ms]
  • distinct1: 0.0018433 [ms] per call 与众不同的1:每次通话0.0018433 [ms]
  • distinct2: 0.0016780 [ms] per call unique2:每次通话0.0016780 [ms]
  • distinct3: 0.0012777 [ms] per call 与众不同3:每次通话0.0012777 [ms]

While being undoubtedly much more complex and much less readable than the original version, the optimized implementation is almost twice as fast. 尽管无疑比原始版本要复杂得多,可读性也要差得多,但是优化的实现几乎快了一倍。 If a simple and readable solution is needed, I would choose either the first or the second implementation, if a fast one is needed, I would choose the last implementation. 如果需要一个简单易读的解决方案,我将选择第一个或第二个实现,如果需要一个快速的解决方案,则将选择最后一个实现。

You can use 您可以使用

(?<=^|;)([^;]+)(?=(?:;\\1(?:$|;))+)

See demo 观看演示

Replacing aaa bbb;aaa bbb;aaa bbb with space results in aaa bbb . 用空格替换aaa bbb;aaa bbb;aaa bbb得到aaa bbb

All the mutliple consecutive ; 所有的连续词; s will have to be replaced with 2 post-processing steps: s必须替换为2个后处理步骤:

  • .replaceAll("^;+|;+$", "") - removes the leading/trailing semi-colons .replaceAll("^;+|;+$", "") -删除前导/尾随的分号
  • .replaceAll(";+",";") - merges all multiple ; .replaceAll(";+",";") -合并所有多个; into 1 ; 成1 ; .

Here is the final IDEONE demo : 这是最终的IDEONE演示

String s = "ccc;aaa bbb;aaa bbb;bbb";
s = s.replaceAll("(?<=^|;)([^;]+)(?=(?:;\\1(?:$|;))+)", "").replaceAll("^;+|;+$", "").replaceAll(";+",";");
System.out.println(s); 

If optimal method == lesser computational complexity then 如果最优方法==较小的计算复杂度,则

Parse the string from the begining, value by value, and create a parallel HashSet with the values you found. 从头开始按值解析字符串,并使用找到的值创建并行HashSet。 When a value exists in the set, you ignore it and go to the next. 当值存在于集合中时,您将忽略它并转到下一个。 If a value does not exist in the set, emit it and add to the set. 如果集合中不存在任何值,则将其发出并添加到集合中。

Find and add at a HashSet are O(1) operations so this algorithm should be O(n). 在HashSet上查找和添加O(1)操作,因此此算法应为O(n)。

It is also O(n) on memory consumption, it could be something to consider depending on the input. 内存消耗也为O(n),可能要根据输入情况进行考虑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM