简体繁体 English

java.util.HashSet和java.util.TreeSet使用什么算法在其结构中存储唯一值？

[英]What Algorithm is used by java.util.HashSet and java.util.TreeSet to store unique values in its structure?

原文 2017-10-22 01:54:21 0 2 java/ set/ time-complexity/ big-o/ hyperloglog

I have come across multiple algorithms such as Flajolet-Martin algorithm , HyperLogLog to find out unique elements from a list of elements and suddenly became curious about how Java calculates it? 我遇到过多种算法，例如Flajolet-Martin算法，HyperLogLog，以从元素列表中找出唯一的元素，并突然对Java的计算方式感到好奇。 And what is the Time-complexity in each of these cases to store and find unique values? 在每种情况下，存储和查找唯一值的时间复杂度是多少？

2 个解决方案

Flajolet-Martin and HyperLogLog algorithms are about getting an approximate count of the distinct elements (the count-distinct problem ) in one pass of a stream of N elements with O(N) time and modest (much better than O(N) ) memory usage. Flajolet -Martin和HyperLogLog算法的目的是在具有O(N)时间和适度（比O(N) ）时间更短的N元素流的一次通过中获得不同元素的近似计数（计数区别问题）。用法。

An implementation of the Map API does not need a solution to the "count-distinct" problem. Map API的实现不需要解决“计数差异”问题。

(Aside: TreeMap and HashMap already keep a precomputed count of the number of entries in the map ¹ ; ie Map.size() . Provided that you don't get into thread-safety problems the result is accurate (not approximate). The cost of calling size() is O(1) . The cost of maintaining it is O(U) where U is the number of map addition and removal operations performed.) （此外： TreeMap和HashMap 已经对映射^1中的条目数进行了预先计算的计数；即Map.size() 。如果您没有遇到线程安全问题，则结果是准确的（不是近似的）。调用size()的成本为O(1) 。维护它的成本为O(U) ，其中U是执行的地图添加和移除操作的数量。）

More generally, Flajolet-Martin algorithm or HyperLogLog do not / cannot form the basis for a Map data structure. 更一般而言，Flajolet-Martin算法或HyperLogLog不能/不能构成Map数据结构的基础。 They do not address the dictionary problem . 他们没有解决字典问题。

The algorithms used by HashMap and TreeMap are (respectively) hash table and binary tree algorithms. HashMap和TreeMap使用的算法分别是哈希表和二叉树算法。 There are more details in the respective javadocs, and the full source code (with comments) is readily available for you to look at. 各个javadocs中有更多详细信息，完整的源代码（带有注释）随时可供您查看。 (Google for "java.util.HashMap" source ... for example.) （例如，Google的"java.util.HashMap" source 。）

^{1 - Interestingly, ConcurrentHashMap doesn't work this way ... because updating the size field would be a concurrency bottleneck.} ^{1-有趣的是， ConcurrentHashMap无法以这种方式工作……因为更新size字段将是并发瓶颈。} ^{Instead, the size() operation is O(N) .} ^{相反， size()操作为O(N) 。}

The HashSet type tracks its elements using a hash table (usually, using closed addressing) and the TreeSet type tracks its elements using a binary search tree. HashSet类型使用哈希表（通常使用封闭式寻址）跟踪其元素，而TreeSet类型使用二进制搜索树跟踪其元素。 These data structures give exact answers to the question "is this element here?" 这些数据结构为“这个元素在这里吗？”这个问题给出了确切的答案。 and are useful for cases where you need to know with 100% certainty whether you've seen something before, and their memory usage is typically directly proportional to the total number of elements seen so far. 且在您需要100％确定地确定您之前是否看过某物的情况下非常有用，并且它们的内存使用情况通常与到目前为止所看到的元素总数成正比。

On the other hand, cardinality estimators like HyperLogLog are good for answering questions of the form "how many distinct elements are there, give or take a few percent?" 另一方面，HyperLogLog之类的基数估计器可以很好地回答“有多少个不同的元素，占或占几个百分比？”形式的问题。 They're great in cases where you need to get a rough estimate of how many distinct things you've seen, where approaches like putting everything in a hash table or a binary search tree would take way too much memory (for example, if you're a Google web server and you want to count distinct IP addresses visiting you), since the amount of memory they use is typically something you get to pick up front. 在需要粗略估计已看到多少不同事物，将所有内容放入哈希表或二进制搜索树等方法会占用太多内存的情况下，它们非常有用（例如，如果您是一台Google Web服务器，并且您希望计算访问您的不同IP地址），因为通常使用它们需要占用的内存量。 However, they don't permit you to answer questions of the form "have I seen this exact thing before ?" 但是，他们不允许您回答以下形式的问题：“我以前看过这个东西吗？” and so wouldn't work as implementations of any of the java.util.Set subtypes. 因此不能作为任何java.util.Set子类型的实现。

In short, the data structures here are designed to solve different problems. 简而言之，此处的数据结构旨在解决不同的问题。 The traditional BST and hash table are there for exact queries where knowing for certain whether you've seen something is the goal and you want to be able to, say, iterate over all the elements seen. 传统的BST和哈希表可用于精确查询，在这些查询中可以确定您是否已经看过某些东西是目标，并且希望能够对所有看到的元素进行迭代。 Cardinality estimators are good where you just care about how many total distinct elements there are, you don't care what they are, and you don't need exact answers. 基数估计器非常有用，您只需关心总共有多少个不同的元素，而不必关心它们是什么，也不需要确切的答案。