简体   繁体   English

Java HashMap检测冲突

[英]Java HashMap detect collision

Is there a way to detect collision in Java Hash-map ? 有没有办法在Java哈希映射中检测冲突? Can any one point out some situation's where lot of collision's can take place. 任何人都可以指出可能发生大量碰撞的情况。 Of-course if you override the hashcode for an object and simply return a constant value collision is sure to occur.I'm not talking about that.I want to know in what all situations other that the previously mentioned do huge number of collisions occur without modifying the default hashcode implementation. 当然,如果你覆盖一个对象的哈希码并简单地返回一个常量值,肯定会发生冲突。我不是在谈论那个。我想知道前面提到的其他所有情况都会发生大量的碰撞无需修改默认的哈希码实现。

I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters). 我创建了一个项目来对这些事情进行基准测试: http//code.google.com/p/hashingbench/ (对于带有链接,开放寻址和布隆过滤器的哈希表)。

Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. 除了密钥的hashCode()之外 ,您还需要知道散列表的“拖尾” (或“加扰”,我在该项目中称之为)。 From this list , HashMap's smearing function is the equivalent of: 这个列表中 ,HashMap的拖尾函数相当于:

public int scramble(int h) {
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1.hashCode()) == scramble(k2.hashCode()) . 因此,对于在HashMap中发生的冲突, 必要充分的条件如下: scramble(k1.hashCode()) == scramble(k2.hashCode()) This is always true if k1.hashCode() == k2.hashCode() (otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient , but not necessary condition for a collision to occur. 这始终是真,如果 k1.hashCode() == k2.hashCode()否则,涂抹/加扰功能将不会一个功能),所以这是对发生碰撞的足够了 ,但不是必要条件。

Edit: Actually, the above necessary and sufficient condition should have been compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode())) - the compress function takes an integer and maps it to {0, ..., N-1} , where N is the number of buckets, so it basically selects a bucket. 编辑:实际上,上面的必要和充分条件应该是compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode())) - compress函数取一个整数并将其映射到{0, ..., N-1} ,其中N是桶的数量,因此它基本上选择一个桶。 Usually, this is simply implemented as hash % N , or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), as hash & N (faster). 通常,这简单地实现为hash % N ,或者当散列表大小是2的幂(并且实际上是具有2个幂散列表大小的动机)时,作为hash & N (更快)。 ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java ). (“compress”是Goodrich和Tamassia用于描述此步骤的名称, 在Java中的数据结构和算法中 )。 Thanks go to ILMTitan for spotting my sloppiness. 感谢ILMTitan发现我的邋。。

Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about. 其他哈希表实现(ConcurrentHashMap,IdentityHashMap等)有其他需求并使用另一个拖尾/加扰函数,因此您需要知道您正在谈论哪一个。

(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - eg new Integer(1 * 1024) , new Integer(2 * 1024) *, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits). (例如,HashMap的拖尾函数已经到位,因为人们使用具有最差类型hashCode()的对象的HashMap用于HashMap的旧的,两个表的实现而没有拖尾 - 对象稍有不同,或者根本没有,用于选择存储桶的低位比特 - 例如new Integer(1 * 1024)new Integer(2 * 1024) *等等。正如您所看到的,HashMap的拖尾函数尽力而为让所有位影响低位)。

All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode(). 但是,所有这些都适用于常见情况 - 特殊情况是继承系统的hashCode()的对象。

PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. PS:实际上,提示实现者插入拖尾函数的绝对丑陋的案例是Floats / Doubles的hashCode(),以及作为值的键的用法:1.0,2.0,3.0,4.0 ......,所有这些都有相同(零)低位。 This is the related old bug report: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4669519 这是相关的旧错误报告: http//bugs.sun.com/bugdatabase/view_bug.do?video_id = 46669519

Simple example: hashing a Long . 简单的例子:哈希Long Obviously there are 64 bits of input and only 32 bits of output. 显然,有64位输入,只有32位输出。 The hash of Long is documented to be: Long的哈希记录为:

(int)(this.longValue()^(this.longValue()>>>32))

ie imagine it's two int values stuck next to each other, and XOR them. 也就是说,想象它是两个彼此相邻的int值,并对它们进行异或。

So all of these will have a hashcode of 0: 因此所有这些都将具有0的哈希码:

0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)

etc 等等

I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture. 我不知道这是否算作“大量碰撞”,但这是碰撞容易制造的一个例子。

Obviously any hash where there are more than 2 32 possible values will have collisions, but in many cases they're harder to produce. 显然, 任何有超过2 32个可能值的哈希都会发生冲突,但在很多情况下它们更难产生。 For example, while I've certainly seen hash collisions on String using just ASCII values, they're slightly harder to produce than the above. 例如,虽然我确实只使用ASCII值看到String上的哈希冲突,但它们比上面的产品稍微难以制作。

The other two answers I see a good IMO but I just wanted to share that the best way to test how well your hashCode() behaves in a HashMap is to actually generate a big number of objects from your class, put them in the particular HashMap implementation as the key and test CPU and memory load. 另外两个答案我看到一个很好的IMO,但我只是想分享一下,测试你的hashCode()HashMap表现有多好的最好方法是从你的类中实际生成大量对象,将它们放在特定的HashMap实现为关键并测试CPU和内存负载。 1 or 2 million entries are a good number to measure but you get best results if you test with your anticipated Map sizes. 一百或两百万个条目是一个很好的数字,但如果您使用预期的地图大小进行测试,您将获得最佳结果。

I just looked at a class that I doubted its hashing function. 我刚看了一堂我怀疑它的散列函数。 So I decided to fill in a HashMap with random objects of that type and test number of collisions. 所以我决定使用该类型的随机对象填充HashMap并测试碰撞次数。 I tested two hashCode() implementations of the class under investigation. 我测试了两个正在调查的类的hashCode()实现。 So I wrote in groovy the class you see at the bottom extending openjdk implementation of HashMap to count number of collisions into the HashMap (see countCollidingEntries() ). 所以我在groovy中编写了你在底部扩展的类,扩展了HashMap的openjdk实现,以计算HashMap中的冲突数(参见countCollidingEntries() )。 Note that these are not real collisions of the whole hash but collisions in the array holding the entries. 请注意,这些不是整个哈希的真实冲突,而是包含条目的数组中的冲突。 Array index is calculated as hash & (length-1) which means that as short the size of this array is, the more collisions you get. 数组索引计算为hash & (length-1) ,这意味着,如果此数组的大小较短,则获得的冲突越多。 And size of this array depends on initialCapacity and loadFactor of the HashMap (it can increase when put() more data). 并且此数组的大小取决于HashMap initialCapacityloadFactor (当put()更多数据时它可以增加)。

At the end though I considered that looking at these numbers does little sense. 最后虽然我认为看这些数字没什么意义。 The fact that HashMap is slower with bad hashCode() method means that by just benchmarking insertion and retrieval of data from the Map you effectively know which hashCode() implementation is better. HashMap使用错误的hashCode()方法较慢这一事实意味着只需通过对Map中数据的插入和检索进行基准测试,您就可以有效地了解哪个hashCode()实现更好。

public class TestHashMap extends HashMap {

   public TestHashMap(int size) {
      super(size);
   }

   public TestHashMap() {
      super();
   }

   public int countCollidingEntries() {
      def fs = this.getClass().getSuperclass().getDeclaredFields();
      def table;
      def count =0 ;
      for ( java.lang.reflect.Field field: fs ) {
         if (field.getName() == "table") {
            field.setAccessible(true);
            table = field.get(super);
            break;
         }
      }
      for(Object e: table) {
         if (e != null) {
            while (e.next != null) {
               count++
               e = e.next;
            }
         }
      }
      return count;
   }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM