简体   繁体   中英

Java - Remove duplicates from a string

I have a string with a list of values separated by semicolon. I need some optimal method to remove duplicates. I have following regular expression:

\b(\w+);(?=.*\b\1;?)

This works, but it fails when there are white spaces. For example aaa bbb;aaa bbb;aaa bbb creates aaa aaa aaa bbb instead of aaa bbb .

Probably simplest solution would be using Sets - collection which doesn't allow duplicates. Split your string on delimiter, and place it in set.

In Java 8 your code can look like:

String result = Stream.of(yourText.split(";"))          //stream of elements separated by ";"
                      .distinct()                       //remove duplicates in stream
                      .collect(Collectors.joining(";"));//create String joining rest of elements using ";"

Pre Java 8 solution can look like:

public String removeDuplicates(String yourText) {
    Set<String> elements = new LinkedHashSet<>(Arrays.asList(yourText.split(";")));

    Iterator<String> it = elements.iterator();

    StringBuilder sb = new StringBuilder(it.hasNext() ? it.next() : "");
    while (it.hasNext()) {
        sb.append(';').append(it.next());
    }

    return sb.toString();
}

This can be implemented in multiple ways. As already mentioned previously, a HashSet is the right way to go. As you state that you need an "optimal" solution I took the time to optimize and benchmark several implementations.

We start with the pre-Java 8 solution by Pshemo:

public static String distinct0(String yourText) {
    Set<String> elements = new LinkedHashSet<>(Arrays.asList(yourText.split(";")));
    Iterator<String> it = elements.iterator();
    StringBuilder sb = new StringBuilder(it.hasNext() ? it.next() : "");
    while (it.hasNext()) {
        sb.append(';').append(it.next());
    }
    return sb.toString();
}

This implementation uses String.split() which creates an array of Strings. This array is then converted to a List, which is added to a LinkedHashSet. The LinkedHashSet preserves the order in which elements have been added by maintaining an additional linked list. Next, an iterator is used to enumerate the elements from the set, which are then concatenated with a StringBuilder.

We can slightly optimize this method by realizing that we can already build the result while iterating over the individual elements in the input string. It is thus not necessary to store information about the order in which distinct strings have been found. This eliminates the need for a LinkedHashSet (and the Iterator):

public static String distinct1(String elements){
    StringBuilder builder = new StringBuilder();
    Set<String> set = new HashSet<String>();
    for (String value : elements.split(";")) {
        if (set.add(value)) {
            builder.append(set.size() != 1 ? ";" : "").append(value);
        }
    }
    return builder.toString();
}

Next, we can get rid of String.split() and thus avoid creating an intermediate array containing all elements from the input string:

public static String distinct2(String elements){

    char[] array = elements.toCharArray();
    StringBuilder builder = new StringBuilder();
    Set<String> set = new HashSet<String>();
    int last = 0;
    for (int index=0; index<array.length; index++) {
        if (array[index] == ';') {
            String value = new String(array, last, (index-last));
            if (set.add(value)) {
                builder.append(last != 0 ? ";" : "").append(value);
            }
            last = index + 1;
        }
    }
    return builder.toString();
}

Finally, we can get rid of unneccessary memory allocations by not constructing String objects for the individual elements, as the constructor String(array, offset, length) (which is also used by String.split()) will call Arrays.copyOfRange(...) to allocate a new char[]. To avoid this overhead, we can implement a wrapper around the input char[] which implements hashCode() and equals() for a given range. This can be used to detect wether a certain string is already contained in the result. Additionally, this method allows us to use StringBuilder.append(array, offset, length), which simply reads data from the provided array:

public static String distinct3(String elements){

    // Prepare
    final char[] array = elements.toCharArray();
    class Entry {
        final int from;
        final int to;
        final int hash;

        public Entry(int from, int to) {
            this.from = from;
            this.to = to;
            int hash = 0;
            for (int i = from; i < to; i++) {
                hash = 31 * hash + array[i];
            }
            this.hash = hash;
        }

        @Override
        public boolean equals(Object object) {
            Entry other = (Entry)object;
            if (other.to - other.from != this.to - this.from) {
                return false;
            }
            for (int i=0; i < this.to - this.from; i++) {
                if (array[this.from + i] != array[other.from + i]) {
                    return false;
                }
            }
            return true;
        }

        @Override
        public int hashCode() {
            return hash;
        }
    }

    // Remove duplicates
    StringBuilder builder = new StringBuilder();
    Set<Entry> set = new HashSet<Entry>();
    int last = 0;
    for (int index=0; index<array.length; index++) {
        if (array[index] == ';') {
            Entry value = new Entry(last, index);
            if (set.add(value)) {
                builder.append(last != 0 ? ";" : "").append(array, last, index-last);
            }
            last = index + 1;
        }
    }
    return builder.toString();
}

I compared these implementations with the following code:

public static void main(String[] args) {

    int REPETITIONS = 10000000;
    String VALUE = ";aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
                   "aaa bbb;;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb;"+
                   "aaa bbb;aaa bbb;aaa bbb;aaa;bbb;aaa bbb;aaa bbb;aaa bbb;aaa bbb";

    long time = System.currentTimeMillis();
    String result = null;
    for (int i = 0; i < REPETITIONS; i++) {
        result = distinct0(VALUE);
    }
    System.out.println(result + " - " + (double) (System.currentTimeMillis() - time) /
                                        (double) REPETITIONS + " [ms] per call");
}

Which gave me the following results when running it on my machine with JDK 1.7.0_51:

  • distinct0: 0.0021881 [ms] per call
  • distinct1: 0.0018433 [ms] per call
  • distinct2: 0.0016780 [ms] per call
  • distinct3: 0.0012777 [ms] per call

While being undoubtedly much more complex and much less readable than the original version, the optimized implementation is almost twice as fast. If a simple and readable solution is needed, I would choose either the first or the second implementation, if a fast one is needed, I would choose the last implementation.

You can use

(?<=^|;)([^;]+)(?=(?:;\\1(?:$|;))+)

See demo

Replacing aaa bbb;aaa bbb;aaa bbb with space results in aaa bbb .

All the mutliple consecutive ; s will have to be replaced with 2 post-processing steps:

  • .replaceAll("^;+|;+$", "") - removes the leading/trailing semi-colons
  • .replaceAll(";+",";") - merges all multiple ; into 1 ; .

Here is the final IDEONE demo :

String s = "ccc;aaa bbb;aaa bbb;bbb";
s = s.replaceAll("(?<=^|;)([^;]+)(?=(?:;\\1(?:$|;))+)", "").replaceAll("^;+|;+$", "").replaceAll(";+",";");
System.out.println(s); 

If optimal method == lesser computational complexity then

Parse the string from the begining, value by value, and create a parallel HashSet with the values you found. When a value exists in the set, you ignore it and go to the next. If a value does not exist in the set, emit it and add to the set.

Find and add at a HashSet are O(1) operations so this algorithm should be O(n).

It is also O(n) on memory consumption, it could be something to consider depending on the input.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM