I want to remove duplicated values in my data. I know it is frequently observed questions in stackoverflow but my problem is a little different because now I am handling very large size of data. Therefore I have to consider the execution time the most in my code.
As below snippet, I made a simple code for removing duplicated values.
// Suppose that the dataset is very huge so that
// multi-node resources should be necessary.
String[] data = new String[10_000_000];
HashMap<String, String> uniqueItems = new HashMap<>();
for (int i = 0; i < data.length; i++) {
if (uniqueItems.containsKey(data[i])) {
uniqueItems.remove(data[i]);
uniqueItems.put(data[i], "inserted");
} else {
uniqueItems.put(data[i], "inserted");
}
}
However, I don't like it because I think that other better data structures or different algorithms could efficiently remove duplicated than my code.
So I want to look for better ways to quickly remove duplicated values when the data is large.
I appreciate it if you could let me know the fastest way of removing duplicated values.
And also, I am wondering if the number of duplicated values could affect the performance. I mean if the duplicated values is 50% of the original data, then the selection of best algorithm and data structures will be changed? If so, I want to find a way that can achieve good performance in general cases.
Convert your uniqueItems
to a HashSet<String>
and your for loop to simply:
uniqueItems.add(data[i]);
If add
returns true
then you have inserted an unique string; false
if duplicated.
Both algorithms should run in O(n) time in the best case, but using a HashMap
when you don't care about the value (for a given key) is silly and wastes resources. A HashSet
is a better fit for such cases.
You could also try a TreeSet<String>
to see what works best for your particular dataset. Probably will be worse, given JDK 8 new HashSet
implementation: over crowded buckets are stored as mini tree sets automatically, providing a competitive performance even when the hashing function behaves badly. (This optimization is only possible for Comparable
types such as String
.)
Brute force searching arrays . In a simple, array based algorithm, where you search the entire array before inserting each element, you would get a very bad O(n²) performance.
As such, you might be tempted to sort your data first, placing duplicated elements near each other. This buys you a faster O(n log n) performance, but still behind the HashMap/HashSet
version in the general case.
Linear is theoretical best . You can't detect all duplicates without visiting each element at least once . As such, our current O(n) time complexity is really the best you can do here.
Of course, you can always try shaving down some of the hidden constants in the Big O notation , but you won't arrive in a asymptotically better algorithm.
In your example the data[i] values are used as 'key's in HashMap uniqueItems.
A HaspMap will always have unique keys. An existing key will be overwritten by a put() operation. You need no conatinsKey() if you want to add a new element.
Why do you remove and insert an existing key ?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.