简体   繁体   中英

What Algorithm is used by java.util.HashSet and java.util.TreeSet to store unique values in its structure?

I have come across multiple algorithms such as Flajolet-Martin algorithm , HyperLogLog to find out unique elements from a list of elements and suddenly became curious about how Java calculates it? And what is the Time-complexity in each of these cases to store and find unique values?

Flajolet-Martin and HyperLogLog algorithms are about getting an approximate count of the distinct elements (the count-distinct problem ) in one pass of a stream of N elements with O(N) time and modest (much better than O(N) ) memory usage.

An implementation of the Map API does not need a solution to the "count-distinct" problem.

(Aside: TreeMap and HashMap already keep a precomputed count of the number of entries in the map 1 ; ie Map.size() . Provided that you don't get into thread-safety problems the result is accurate (not approximate). The cost of calling size() is O(1) . The cost of maintaining it is O(U) where U is the number of map addition and removal operations performed.)

More generally, Flajolet-Martin algorithm or HyperLogLog do not / cannot form the basis for a Map data structure. They do not address the dictionary problem .

The algorithms used by HashMap and TreeMap are (respectively) hash table and binary tree algorithms. There are more details in the respective javadocs, and the full source code (with comments) is readily available for you to look at. (Google for "java.util.HashMap" source ... for example.)


1 - Interestingly, ConcurrentHashMap doesn't work this way ... because updating the size field would be a concurrency bottleneck. Instead, the size() operation is O(N) .

The HashSet type tracks its elements using a hash table (usually, using closed addressing) and the TreeSet type tracks its elements using a binary search tree. These data structures give exact answers to the question "is this element here?" and are useful for cases where you need to know with 100% certainty whether you've seen something before, and their memory usage is typically directly proportional to the total number of elements seen so far.

On the other hand, cardinality estimators like HyperLogLog are good for answering questions of the form "how many distinct elements are there, give or take a few percent?" They're great in cases where you need to get a rough estimate of how many distinct things you've seen, where approaches like putting everything in a hash table or a binary search tree would take way too much memory (for example, if you're a Google web server and you want to count distinct IP addresses visiting you), since the amount of memory they use is typically something you get to pick up front. However, they don't permit you to answer questions of the form "have I seen this exact thing before ?" and so wouldn't work as implementations of any of the java.util.Set subtypes.

In short, the data structures here are designed to solve different problems. The traditional BST and hash table are there for exact queries where knowing for certain whether you've seen something is the goal and you want to be able to, say, iterate over all the elements seen. Cardinality estimators are good where you just care about how many total distinct elements there are, you don't care what they are, and you don't need exact answers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM