简体   繁体   中英

Most fastest and efficient way to remove duplicates in Java

I want to remove duplicated values in my data. I know it is frequently observed questions in stackoverflow but my problem is a little different because now I am handling very large size of data. Therefore I have to consider the execution time the most in my code.

As below snippet, I made a simple code for removing duplicated values.

// Suppose that the dataset is very huge so that
// multi-node resources should be necessary.    
String[] data = new String[10_000_000];

HashMap<String, String> uniqueItems = new HashMap<>();

for (int i = 0; i < data.length; i++) {
    if (uniqueItems.containsKey(data[i])) {
        uniqueItems.remove(data[i]);
        uniqueItems.put(data[i], "inserted");
    } else {
        uniqueItems.put(data[i], "inserted");
    }
}

However, I don't like it because I think that other better data structures or different algorithms could efficiently remove duplicated than my code.

So I want to look for better ways to quickly remove duplicated values when the data is large.
I appreciate it if you could let me know the fastest way of removing duplicated values.

And also, I am wondering if the number of duplicated values could affect the performance. I mean if the duplicated values is 50% of the original data, then the selection of best algorithm and data structures will be changed? If so, I want to find a way that can achieve good performance in general cases.

Convert your uniqueItems to a HashSet<String> and your for loop to simply:

uniqueItems.add(data[i]);

If add returns true then you have inserted an unique string; false if duplicated.

Both algorithms should run in O(n) time in the best case, but using a HashMap when you don't care about the value (for a given key) is silly and wastes resources. A HashSet is a better fit for such cases.

You could also try a TreeSet<String> to see what works best for your particular dataset. Probably will be worse, given JDK 8 new HashSet implementation: over crowded buckets are stored as mini tree sets automatically, providing a competitive performance even when the hashing function behaves badly. (This optimization is only possible for Comparable types such as String .)


Brute force searching arrays . In a simple, array based algorithm, where you search the entire array before inserting each element, you would get a very bad O(n²) performance.

As such, you might be tempted to sort your data first, placing duplicated elements near each other. This buys you a faster O(n log n) performance, but still behind the HashMap/HashSet version in the general case.


Linear is theoretical best . You can't detect all duplicates without visiting each element at least once . As such, our current O(n) time complexity is really the best you can do here.

Of course, you can always try shaving down some of the hidden constants in the Big O notation , but you won't arrive in a asymptotically better algorithm.

In your example the data[i] values are used as 'key's in HashMap uniqueItems.

A HaspMap will always have unique keys. An existing key will be overwritten by a put() operation. You need no conatinsKey() if you want to add a new element.

Why do you remove and insert an existing key ?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM