简体   繁体   English

随机访问大量对象(如哈希表)的建议

[英]Recommendation for random accessing large amount of objects (like a hash table)

I'm processing some generated data files (hundreds of Mbytes) which contains several G objects. 我正在处理一些包含几个G对象的生成的数据文件(数百兆字节)。 I need to random access these objects. 我需要随机访问这些对象。 A possible implementation, I guess, might be a big HashTable . 我猜可能的实现可能是一个很大的HashTable My program is written in Java and it seems the java.util.HashMap cannot handle this (somehow it's extremely slow). 我的程序是用Java编写的,似乎java.util.HashMap无法处理此问题(某种程度上非常慢)。 Could anyone recommend a solution to random accessing these objects? 谁能推荐一种随机访问这些对象的解决方案?

If a HashMap is extremely slow, then the two most likely causes are as follows: 如果HashMap非常慢,则两个最可能的原因如下:

  • The hashCode() and/or equals(Object) methods on your key class could be very expensive. 密钥类上的hashCode()和/或equals(Object)方法可能非常昂贵。 For instance, if you use an array or a collection as a key, the hashCode() method will access every element each time you call it, and the equals method will do the same for equal keys. 举例来说,如果你使用数组或一个集合作为一个重要的hashCode()方法,将每次调用访问的每一个元素,和equals法将做同样平等的密钥。

  • Your key class could have a poor hashCode() method that is giving the same value for a significant percentage of the (distinct) keys used by the program. 您的键类可能具有较差的hashCode()方法,该方法为程序所用的(不同)键的很大一部分提供了相同的值。 When this occurs you get many key collisions, and that can be really bad for performance when the hash table gets large. 发生这种情况时,您会遇到很多键冲突,而当哈希表变大时,这可能会严重损害性能。

I suggest you look at these possibilities first ... before changing your data structure. 我建议您在更改数据结构之前先看看这些可能性。


Note: if "several G objects" means several billion objects, then you'll have trouble holding the files' contents in memory ... unless you are running this application on a machine with 100's of gigabytes of RAM. 注意:如果“几个G对象”意味着数十亿个对象,那么您将难以将文件内容保存在内存中……除非您在具有100 GB内存的计算机上运行此应用程序。 I advise you do some "back of the envelope" calculations to see if what you are trying to do is feasible. 我建议您进行一些“封底”计算,以查看您尝试执行的操作是否可行。

Whatever your keys are, make sure you're generating a good hash for each one via hashCode() . 无论您使用什么密钥,请确保通过hashCode()为每个密钥生成一个良好的哈希。 A lot of times bad HashMap performance can be blamed on colliding hashes. 很多时候,HashMap的性能不佳可归因于冲突哈希。 When there's a collision, HashMap generates a linked list for the colliding objects. 发生碰撞时,HashMap会为碰撞对象生成一个链接列表。

Worst-case if you're returning the same hash for all objects, HashMap essentially becomes a linked list. 最糟糕的情况是,如果您为所有对象返回相同的哈希,则HashMap本质上将成为链接列表。 Here's a good starting place for writing hash functions: http://www.javamex.com/tutorials/collections/hash_function_guidelines.shtml 这是编写哈希函数的一个很好的起点: http : //www.javamex.com/tutorials/collections/hash_function_guidelines.shtml

A few hundred MB cannot hold several billion objects unless each object is a bit (which is not really an object IMHO). 除非每个对象都是一点(实际上不是对象恕我直言),否则数百MB不能容纳数十亿个对象。

How I would approach this is to use memory mapped file to map in the contents of the data and to build your own hash table in another memory mapped file (which requires you to scan the data once to build the keys) 我的解决方法是使用内存映射文件映射数据的内容,并在另一个内存映射文件中构建自己的哈希表(这要求您扫描数据一次以构建键)

Depending on the layout of the data, it is worth remembering that random access is not the most efficient way to cache data ie your cache loaded lines of 64 bytes (depending on architecture) and if your structure doesn't fit in memory, record based tables may be more efficient. 根据数据的布局,值得记住的是,随机访问并不是缓存数据的最有效方法,即,缓存加载的64字节行(取决于体系结构);如果结构不适合内存,则基于记录表可能会更有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM