简体   繁体   English

Java项目:提高HashMap(包括负载存储)的性能

[英]Java Project: Make HashMap (including Load-Store) Performance Better

I am trying to code for our server in which I have to find users access type by URL. 我正在尝试为我们的服务器编写代码,其中我必须通过URL查找用户访问类型。

Now, at the beginning, we see 100 millions distinct URL's are accessed per day. 现在,一开始我们看到每天访问1亿个不同的URL。 Now, by the time going it became nearly 600 millions distinct URL's per day. 现在,到现在为止,它已变成每天将近6亿个不同的URL。

For 100 millions, what we did is following: 对于1亿,我们所做的是:

1) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values. 1)使用并行数组构建一个HashMap,该并行数组的键是URL的一部分(以LONG表示),值是URL的另一部分(以INT表示)-键可以具有多个值。

2) Then search the HashMap to find how many time URL accessed. 2)然后搜索HashMap以查找访问了多少次URL。

Now, as the HashTable become larger, what we did is following: 现在,随着HashTable的变大,我们要做的是:

1) Build two/three separate HashTable, and load and store it (on general file system) to find how many times URL accessed. 1)构建两个/三个单独的HashTable,然后将其加载和存储(在常规文件系统上)以查找访问URL的次数。

Now, issue is, 现在的问题是

1) Though the HashTable performance is quite nice, code takes more time while loading/storing HashTable (we are using File Channel, takes 16-19 seconds to load/store HashTable - 200 millions entry- as load factor is 0.5) 1)虽然HashTable的性能相当不错,但是在加载/存储HashTable时代码需要花费更多时间(我们正在使用文件通道,加载/存储HashTable需要16-19秒-2亿个条目-加载因子为0.5)

What we are trying to ask is: 我们想问的是:

1) Any comment how to solve this issue ? 1)任何意见如何解决这个问题?

2) How to reduce load/store time (I asked before but seems File Channel is the best way) ? 2)如何减少加载/存储时间(我之前问过,但似乎文件通道是最好的方法)?

3) Is storing a large HashTable (more than memory) and caching it repeatedly will be a nice solution ? 3)是否存储一个大的HashTable(而不是内存)并重复缓存将是一个不错的解决方案? If so, how to do that (at least some pointers). 如果是这样,该如何做(至少一些指针)。 We tried it by using 我们尝试使用

RandomAccessFile raf = new RandomAccessFile("array.dat", "rw");
IntBuffer map = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, 1 << 30).order(ByteOrder.nativeOrder()).asIntBuffer();

However, gives worser performance than previous. 但是,其性能比以前差。

Thanks. 谢谢。

NB: 注意:

1) As per previous suggestions of Stack Overflow, we use some NoSQL DB like TokyoCabinet but from our experience, a custom HashTable gives better performance than it on 100 millions key-value pairs. 1)根据堆栈溢出的先前建议,我们使用了一些NoSQL DB,例如TokyoCabinet,但根据我们的经验,自定义HashTable在1亿个键值对上的性能要优于它。

2) Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts. 2)无法预先读取用于磁盘缓存的数据,因为当系统启动时,我们的应用程序将开始运行,第二天系统启动时,应用程序将开始运行。

What We forgot to mention is: 我们忘记提及的是:

1) As our application is a part of project and to be applied on a small campus, so we assume URL accessed is not more than 800 million. 1)由于我们的应用程序是项目的一部分,并且将在一个小型园区中应用,因此我们假定访问的URL不超过8亿个。 So, you can think 600/700 data value is fixed. 因此,您可以认为600/700数据值是固定的。

2) Our main concern is performance. 2)我们主要关注的是性能。

3) We have to run our application locally. 3)我们必须在本地运行我们的应用程序。

Edit: code of our hashmap can be found here. 编辑:我们的哈希图代码可以在这里找到。

It might be best to access the table as a memory-mapped buffer. 最好将表作为内存映射的缓冲区来访问。 That way, you could simply implement random access to the file, without worrying about loading and storing, and leave caching to the operating system. 这样,您可以简单地实现对文件的随机访问,而不必担心加载和存储,并将缓存留给操作系统。 I see that your current implementation already does use memory-mapped access for reading and writing, but it still loads things into the java heap in between. 我看到您当前的实现确实已经使用了内存映射的访问方式进行读写,但仍将两者之间的内容加载到Java堆中。 Avoid this data duplication and copying! 避免这种数据重复和复制! Treat the backing file itself as the data structure, and only access the portions of it that you actually need, only when you need them. 将备份文件本身视为数据结构,仅在需要时才访问您实际需要的部分。

Within that file, hash maps will work if you are really really sure that hash collisions are not an issue. 在此文件中,如果您确实确定哈希冲突不是问题,则哈希映射将起作用。 Otherwise I'd go for a B+ tree there, with nodes about the size of your hard disk pages. 否则,我会去那里的一棵B +树 ,其中的节点大约等于您的硬盘页面大小。 That way, each disk access will yield a lot more of usable data than just a single key, thus resulting in a more shallow tree and less individual disc operations. 这样,每个磁盘访问将产生比仅单个键更多的可用数据,从而导致树更浅,单个磁盘操作更少。

I guess others will have implemented stuff like this, but if you prefer your own hash map implementation, you might prefer to write your own memory-mapped B+ trees as well. 我猜想其他人会实现这样的东西,但是如果您更喜欢自己的哈希映射实现,则可能更喜欢编写自己的内存映射B +树。

The whole approach sounds ridiculus to me. 整个方法对我来说听起来很可笑。 I gather what you really want to achive is a simple access counter per distinct URL. 我收集到您真正想要实现的是每个不同URL的简单访问计数器。 By its very nature, this data is frequently written but rarely ever read. 就其本质而言,此数据经常被写入,但很少读取。

For this purpose, I would simply have a database table and add a new entry for every access (it can serve as log as well). 为此,我只需拥有一个数据库表,并为每次访问添加一个新条目(它也可以用作日志)。 When you need to figure out how often any URL was accessed this can be easily done using a SELECT COUNT from the table (depending on how much additional data you store along with the URL entries, you can even do constrainted counts like how often accessed yesterday, last week etc). 当您需要找出访问任何URL的频率时,可以使用表中的SELECT COUNT来轻松完成此操作(取决于与URL条目一起存储的额外数据量,您甚至可以进行约束计数,例如昨天的访问频率) ,上周等)。

This puts all the work off to the point where the result is really needed. 这将所有工作拖到了真正需要结果的地步。

BTW, you may be able to retrieve the access counts from the web servers log files as well, so maybe you don't need to write any data yourself. 顺便说一句,您也可以从Web服务器日志文件中检索访问计数,因此也许您不需要自己编写任何数据。 Look into this first. 先看看这个。

You can use a caching framework like JCS . 您可以使用JCS之类的缓存框架。 1 billion key-value pairs should not be a problem. 10亿个键值对应该不是问题。

http://commons.apache.org/jcs/ http://commons.apache.org/jcs/

绝对尝试redis ,认为它能击败其他任何东西

You can use Berkeley DB which is basically a key/value store written in C for ultimate performance. 您可以使用Berkeley DB ,它基本上是用C编写的键/值存储,以实现最终性能。 It's an Oracle product (Open Source though) so I would take it serious. 这是一个Oracle产品(虽然是开源的),所以我会认真对待。

If your application has to run locally without the usage of any external computing power, there is no solution which can be more performant then direct-memory access: the only data structure which can provides you better performances then an HashMap is an array, where the access at every element is O(1). 如果您的应用程序必须在本地运行而不使用任何外部计算能力,那么没有比直接内存访问更高性能的解决方案:唯一可以为您提供更好性能的数据结构就是HashMap,这是数组。每个元素的访问权限为O(1)。 This requires however knowing in advance how many items you have, have a unique addressing index per element, and also being able of reserving significant adjacent memory. 但是,这需要预先知道您有多少项,每个元素具有唯一的寻址索引,并且还必须保留大量的相邻内存。

After arrays, which as described are suitable for limited cases, you have HashTables, however as the size of the data grows, the cost with collisions and dynamic resize increase and makes the performance poor. 在描述了适用于有限情况的数组之后,您有了HashTables,但是随着数据大小的增长,冲突和动态调整大小的成本增加,并使性能变差。

You can refer to java.util.HashMap javadoc but also to Wikipedia http://en.wikipedia.org/wiki/Hash_table to understand the following: 您可以参考java.util.HashMap javadoc,也可以参考Wikipedia http://en.wikipedia.org/wiki/Hash_table以了解以下内容:

  • How expensive is it to compute it? 计算多少钱?
  • How the value are well distributed? 价值如何合理分配?
  • What is the load factor that you are using, ie what cost will you have for conflict resolution? 您使用的负载系数是多少,即解决冲突将花费多少成本?
  • How often will you need to resize your HashMap before you get to have it full contained all data? 您需要多长时间调整一次HashMap的大小,才能使其完全包含所有数据?

If your performance degradates when building your HashMap, which I actually believe it's a ConcurrentHashMap (if you build it parallely it has to be thread safe), you might want to investigate why it happens. 如果在构建HashMap时性能下降,而我实际上认为这是ConcurrentHashMap(如果并行构建它必须是线程安全的),则您可能想调查为什么会发生。

A simple, but easy beginning will be to replace your HashMap with a TreeMap, whose performances are a deterministic function of its size, and compare the two performances. 一个简单但容易的开始就是将您的HashMap替换为TreeMap,TreeMap的性能是其大小的确定性函数,并比较这两种性能。


If on the other side I misinterpreted your question and you have the opportunity to scale on multiple machines the computation, you have plenty of interesting solution on the market as someone has already pointed out, and to which I would add Cassandra. 另一方面,如果我误解了您的问题,并且您有机会在多台计算机上扩展计算,那么正如有人已经指出的那样,您在市场上有很多有趣的解决方案,我将在其中添加Cassandra。

These solutions achieve performance improvement by distributing the load among multiple nodes, but inside each node use well-known algorithm for fast and efficient addressing. 这些解决方案通过在多个节点之间分配负载来提高性能,但是在每个节点内部使用众所周知的算法进行快速有效的寻址。

Not clear for question and follow-up discussion, but what's the nature of your queries? 对于问题和后续讨论尚不清楚,但是您的查询的本质是什么? You've got very different situations between 您之间的情况截然不同
a) working through all ~700 million URLs during each working day, or a)在每个工作日内浏览所有约7亿个网址,或
b) hitting some small number of those ~700 million URL's. b)击中约7亿个URL中的一小部分。

So: what's the ratio of # of queries to the # of URLs? 那么:查询数量与网址数量的比率是多少?

From your descriptions, it sounds like you may be loading/unloading the different files representing different portions of your array... which suggests random queries, which suggests (b). 从您的描述中,听起来您可能正在加载/卸载代表数组不同部分的不同文件……这建议使用随机查询,建议使用(b)。

As well, I gather you've already recognized that "all-in-memory" isn't feasible (ie you've broken the array across multiple files), so an optimal disk-access algorithm seems to be the next order of business, no? 同样,我收集到您已经认识到“全内存”是不可行的(即,您已经破坏了跨多个文件的阵列),因此,最佳的磁盘访问算法似乎是下一步的工作,不?

Have you tried, per query, a simple seek (n * arrayElementSize) to offset in file and just read a few pages into memory (do you have/know a maximum # of values per key?). 您是否已针对每个查询尝试了一个简单的查找(n * arrayElementSize)来偏移文件,并仅将几页读入内存(您是否知道每个键的最大数量的值?)。 You've already got (computed) the base index into your array, so this should be easy to prototype. 您已经将(索引)基本索引放入了数组中,因此应该易于原型化。

I would suggest you to use Oracle Coherence Cache . 我建议您使用Oracle Coherence Cache You can get all the benefits of HashTable it has all the methods which Map has. 您可以获得HashTable所有好处,它具有Map拥有的所有方法。

Performance wise you can store data as per you requirement.Please have a look . 在性能方面,您可以根据需要存储数据。请看一下。

You can try HugeCollections , I think it was written for this purpose 您可以尝试HugeCollections ,我认为它是为此目的而编写的

HugeCollections 大量收藏
Library to support collections with millions or billions of entries. 支持数百万或数十亿条目的馆藏的图书馆。

specifically HugeMap 特别是HugeMap

在内存数据库中使用开源sqlite

If I understand you correctly, your data structure is not that big 如果我理解正确,那么您的数据结构就不会那么大

[(32 + 64) * 600 million] bits i.e. a 53.644 MB structure in memory

The map data structure would consume some space too. 地图数据结构也会占用一些空间。 I've found out the hard way that trove is one of the most memory efficient data structures around. 我发现trove很难成为周围内存效率最高的数据结构之一。 I'd use a TLongIntHashMap to store long keys and integer values. 我将使用TLongIntHashMap来存储长键和整数值。 It stored raw primitives so that you bypass the Long and Integer memory objects 它存储原始图元,以便您绕过Long和Integer内存对象

It seems You have a mostly read-only dataset that does not fit in memory, and You need fast key-lookups. 似乎您有一个只读的数据集,该数据集不适合内存,并且您需要快速的键查找。 I am afraid there is no silver-bullet solution here, except for a few possible tradeoffs. 除非有一些可能的权衡,否则恐怕这里没有解决方案。

If You access the 600M records all-over-the-place, No matter what You do You are going to be limited by disk random access speed (not sequential access speeed). 如果您遍地访问600M记录,则无论您做什么,都将受到磁盘随机访问速度的限制(不加快顺序访问的速度)。 Use FileChannel.map to directly access the file (no, don't read the contents of the file in memory, just operate on the MappedByteBuffer . Your OS will take care of caching for You). 使用FileChannel.map直接访问文件(不,不要读取内存中文件的内容,只需在MappedByteBuffer上进行操作即可。操作系统将为您进行缓存)。 Investing in a SSD looks to be a good way to spend money (or maybe just buy some more memory?). 投资固态硬盘似乎是一种花钱的好方法(或者也许只是购买更多的内存?)。

This is a campus environment, right? 这是校园环境,对吗? Maybe You can use computers in a lab to make a memcached/redis/etc. 也许您可以在实验室中使用计算机制作memcached / redis / etc。 cluster? 簇? Maybe You could use it off-hours? 也许您可以在下班时间使用它?

If You access some identifiable pieces of data at the same time (ie now we analyze domain a, then b, etc.), then splitting the data into buckets is a good idea. 如果您同时访问一些可识别的数据(即现在我们分析域a,然后是b等),则将数据拆分为存储桶是个好主意。 Like keep the related data physically close, to help caching. 就像保持相关数据在物理上接近一样,以帮助缓存。 Or maybe pre-sort the urls, and access them in binary-search fashion? 还是可以对URL进行预排序,然后以二进制搜索的方式访问它们?

If some probability of collisions is acceptable, maybe not storing the full urls but only 64-bit hashes of urls as hash keys is acceptable? 如果发生冲突的可能性是可以接受的,也许不存储完整的URL,而是仅将64位的URL哈希作为哈希键是可以接受的? With some gymnastics You could probably get away with not storing the keys at all? 在进行一些体操运动时,您可能根本不用存储钥匙就可以逃脱?

That's my ideas for the moment. 这是我目前的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM