Most fastest and efficient way to remove duplicates in Java

Question

I want to remove duplicated values in my data. I know it is frequently observed questions in stackoverflow but my problem is a little different because now I am handling very large size of data. Therefore I have to consider the execution time the most in my code.

As below snippet, I made a simple code for removing duplicated values.

// Suppose that the dataset is very huge so that
// multi-node resources should be necessary.    
String[] data = new String[10_000_000];

HashMap<String, String> uniqueItems = new HashMap<>();

for (int i = 0; i < data.length; i++) {
    if (uniqueItems.containsKey(data[i])) {
        uniqueItems.remove(data[i]);
        uniqueItems.put(data[i], "inserted");
    } else {
        uniqueItems.put(data[i], "inserted");
    }
}

However, I don't like it because I think that other better data structures or different algorithms could efficiently remove duplicated than my code.

So I want to look for better ways to quickly remove duplicated values when the data is large.
I appreciate it if you could let me know the fastest way of removing duplicated values.

And also, I am wondering if the number of duplicated values could affect the performance. I mean if the duplicated values is 50% of the original data, then the selection of best algorithm and data structures will be changed? If so, I want to find a way that can achieve good performance in general cases.

Answer 1

Convert your uniqueItems to a HashSet<String> and your for loop to simply:

uniqueItems.add(data[i]);

If add returns true then you have inserted an unique string; false if duplicated.

Both algorithms should run in O(n) time in the best case, but using a HashMap when you don't care about the value (for a given key) is silly and wastes resources. A HashSet is a better fit for such cases.

You could also try a TreeSet<String> to see what works best for your particular dataset. Probably will be worse, given JDK 8 new HashSet implementation: over crowded buckets are stored as mini tree sets automatically, providing a competitive performance even when the hashing function behaves badly. (This optimization is only possible for Comparable types such as String .)

Brute force searching arrays . In a simple, array based algorithm, where you search the entire array before inserting each element, you would get a very bad O(n²) performance.

As such, you might be tempted to sort your data first, placing duplicated elements near each other. This buys you a faster O(n log n) performance, but still behind the HashMap/HashSet version in the general case.

Linear is theoretical best . You can't detect all duplicates without visiting each element at least once . As such, our current O(n) time complexity is really the best you can do here.

Of course, you can always try shaving down some of the hidden constants in the Big O notation , but you won't arrive in a asymptotically better algorithm.

Answer 2

In your example the data[i] values are used as 'key's in HashMap uniqueItems.

A HaspMap will always have unique keys. An existing key will be overwritten by a put() operation. You need no conatinsKey() if you want to add a new element.

Why do you remove and insert an existing key ?

Most fastest and efficient way to remove duplicates in Java

Question

2 answers

solution1
5 ACCPTED 2017-05-18 12:33:12

solution2
1 2017-05-18 14:38:18

Most fastest and efficient way to remove duplicates in Java

Question

2 answers

solution1 5 ACCPTED 2017-05-18 12:33:12

solution2 1 2017-05-18 14:38:18

solution1
5 ACCPTED 2017-05-18 12:33:12

solution2
1 2017-05-18 14:38:18