简体   繁体   English

HashMap替代内存高效的数据存储

[英]HashMap alternatives for memory-efficient data storage

I've currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. 我目前有一个电子表格类型程序,它将数据保存在HashMaps的ArrayList中。 You'll no doubt be shocked when I tell you that this hasn't proven ideal. 当我告诉你这没有被证明是理想的时候,你无疑会感到震惊。 The overhead seems to use 5x more memory than the data itself. 开销似乎比数据本身多5倍的内存。

This question asks about efficient collections libraries, and the answer was use Google Collections. 这个问题询问有效的集合库,答案是使用Google Collections。 My follow up is " which part? " . 我的跟进是“ 哪一部分? I've been reading through the documentation but don't feel like it gives a very good sense of which classes are a good fit for this. 我一直在阅读文档,但不觉得它非常好地了解哪些类适合这个。 (I'm also open to other libraries or suggestions). (我也对其他图书馆或建议开放)。

So I'm looking for something that will let me store dense spreadsheet-type data with minimal memory overhead. 所以我正在寻找能够以最小的内存开销存储密集的电子表格类型数据的东西。

  • My columns are currently referenced by Field objects, rows by their indexes, and values are Objects, almost always Strings 我的列当前由Field对象引用,行由它们的索引引用,值是Objects,几乎总是字符串
  • Some columns will have a lot of repeated values 有些列会有很多重复的值
  • primary operations are to update or remove records based on values of certain fields, and also adding/removing/combining columns 主要操作是根据某些字段的值更新或删除记录,以及添加/删除/组合列

I'm aware of options like H2 and Derby but in this case I'm not looking to use an embedded database. 我知道H2和Derby等选项,但在这种情况下我不打算使用嵌入式数据库。

EDIT : If you're suggesting libraries, I'd also appreciate it if you could point me to a particular class or two in them that would apply here. 编辑 :如果你建议图书馆,我也很感激你,如果你能指出我在这里适用的特定的一两节课。 Whereas Sun's documentation usually includes information about which operations are O(1), which are O(N), etc, I'm not seeing much of that in third-party libraries, nor really any description of which classes are best suited for what. 虽然Sun的文档通常包含哪些操作是O(1)的信息,哪些是O(N)等,但我在第三方库中没有看到太多,也没有真正描述哪些类最适合什么类。

Some columns will have a lot of repeated values 有些列会有很多重复的值

immediately suggests to me the possible use of the FlyWeight pattern , regardless of the solution you choose for your collections. 无论您为收藏品选择何种解决方案,我都会立即向我建议使用FlyWeight模式

Trove collections should have a particular care about space occupied (I think they also have tailored data structures if you stick to primitive types).. take a look here . Trove集合应特别关注占用的空间(我认为如果你坚持原始类型,它们也有定制的数据结构)。看看这里

Otherwise you can try with Apache collections .. just do your benchmarks! 否则你可以试试Apache系列 ..只需做你的基准测试!

In anycase, if you've got many references around to same elements try to design some suited pattern (like flyweight ) 在任何情况下,如果你有很多参考相同的元素尝试设计一些适合的模式(如flyweight

So I'm assuming that you have a map of Map<ColumnName,Column> , where the column is actually something like ArrayList<Object> . 所以我假设您有一个Map<ColumnName,Column>Map<ColumnName,Column> ,其中列实际上类似于ArrayList<Object>

A few possibilities - 一些可能性 -

  • Are you completely sure that memory is an issue? 你完全确定记忆是个问题吗? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. 如果你只是普遍担心尺寸,那么值得确认这在运行程序中确实会成为一个问题。 It takes an awful lot of rows and maps to fill up a JVM. 它需要大量的行和映射才能填满JVM。

  • You could test your data set with different types of maps in the collections. 您可以使用集合中不同类型的地图测试数据集。 Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. 根据您的数据,您还可以使用可能有用的预设尺寸/载荷系数组合初始化地图。 I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky. 我过去曾经搞过这个问题,如果你幸运的话,你可能会减少30%的记忆力。

  • What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns? 如何将数据存储在单个类似矩阵的数据结构(现有的库实现或类似于列表列表的包装器之类)中,使用单个映射将列键映射到矩阵列?

Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. 假设您的所有行都具有大多数相同的列,您可以只为每一行使用一个数组,并使用Map <ColumnKey,Integer>来查找哪些列引用哪个单元格。 This way you have only 4-8 bytes of overhead per cell. 这样,每个单元只有4-8个字节的开销。

If Strings are often repeated, you could use a String pool to reduce duplication of strings. 如果经常重复使用字符串,则可以使用字符串池来减少字符串的重复。 Object pools for other immutable types may be useful in reducing memory consumed. 其他不可变类型的对象池可用于减少消耗的内存。

EDIT: You can structure your data as either row based or column based. 编辑:您可以将数据结构化为基于行或基于列。 If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. 如果基于行(每行一个单元格数组)添加/删除行只是删除此行的问题。 If its columns based, you can have an array per column. 如果基于列,则每列可以有一个数组。 This can make handling primitive types much more efficent. 这可以使处理原始类型更有效。 ie you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row. 也就是说,你可以有一个是int []的列,另一个是double [],它更常见于整个列具有相同的数据类型,而不是整行的数据类型相同。

However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset. 但是,无论采用哪种方式来构造数据,都可以选择进行行或列修改,并执行其他类型的添加/删除将导致重建整个数据集。

(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it) (我做的事情是有基于行的数据并添加列到最后,假设一行不够长,列有一个默认值,这可以避免在添加列时重建。而不是删除列,我有一种忽视它的方法)

Guava does include a Table interface and a hash-based implementation. Guava确实包含Table接口和基于哈希的实现。 Seems like a natural fit to your problem. 似乎很适合您的问题。 Note that this is still marked as beta. 请注意,这仍然标记为测试版。

keeps its data in an ArrayList of HashMaps 将其数据保存在HashMaps的ArrayList中
Well, this part seems terribly inefficient to me. 嗯,这部分对我来说似乎非常低效。 Empty HashMap will already allocate 16 * size of a pointer bytes (16 stands for default initial capacity), plus some variables for hash object (14 + psize). 空HashMap已经分配了16 * size of a pointer字节(16代表默认初始容量),以及散列对象的一些变量(14 + psize)。 If you have a lot of sparsely filled rows, this could be a big problem. 如果你有很多稀疏的行,这可能是一个大问题。

One option would be to use a single large hash with composite key (combining row and column). 一种选择是使用具有复合键的单个大散列(组合行和列)。 Although, that doesn't make operations on whole rows very effective. 虽然,这不会使整行操作非常有效。

Also, since you don't mention the operation of adding cell, you can create hashes with only necessary inner storage ( initialCapacity parameter). 此外,由于您未提及添加单元格的操作,因此可以仅使用必要的内部存储( initialCapacity参数)创建哈希。

I don't know much about google collections, so can't help there. 我不太了解谷歌收藏,所以无法帮助那里。 Also, if you find any useful optimization, please do post here! 此外,如果您发现任何有用的优化,请在此处发布! It would be interesting to know. 知道会很有趣。

I've been experimenting with using the SparseObjectMatrix2D from the Colt project. 我一直在尝试使用Colt项目中的SparseObjectMatrix2D My data is pretty dense but their Matrix classes don't really offer any way to enlarge them, so I went with a sparse matrix set to the maximum size. 我的数据非常密集,但他们的Matrix类并没有真正提供任何扩大它们的方法,因此我选择了一个稀疏矩阵设置为最大尺寸。

It seems to use roughly 10% less memory and loads about 15% faster for the same data, as well as offering some clever manipulation methods. 对于相同的数据,它似乎使用大约10%的内存并加载大约15%,并提供一些巧妙的操作方法。 Still interested in other options though. 仍然对其他选项感兴趣。

Chronicle Map could have overhead of less than 20 bytes per entry (see a test proving this). Chronicle Map每个条目的开销可能少于20个字节(参见测试证明这一点)。 For comparison, java.util.HashMap's overhead varies from 37-42 bytes with -XX:+UseCompressedOops to 58-69 bytes without compressed oops ( reference ). 为了进行比较,java.util.HashMap的开销从37-42字节到-XX:+UseCompressedOops到58-69字节不带压缩oops( 引用 )。

Additionally, Chronicle Map stores keys and values off-heap, so it doesn't store Object headers, which are not accounted as HashMap's overhead above. 此外,Chronicle Map在堆外存储键和值,因此它不存储Object头,这些头部不会被视为HashMap的开销。 Chronicle Map integrates with Chronicle-Values , a library for generation of flyweight implementations of interfaces, the pattern suggested by Brian Agnew in another answer. Chronicle Map与Chronicle-Values 集成Chronicle-Values是一个用于生成接口flyweight实现的库, Brian Agnew在另一个答案中提出了这种模式。

Why don't you try using cache implementation like EHCache . 为什么不尝试使用像EHCache这样的缓存实现。 This turned out to be very effective for me, when I hit the same situation. 当我遇到同样的情况时,这对我来说非常有效。
You can simply store your collection within the EHcache implementation. 您可以将您的集合存储在EHcache实现中。 There are configurations like: 有如下配置:

Maximum bytes to be used from Local heap.

Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. 一旦应用程序使用的字节溢出了缓存中配置的字节,则缓存实现会负责将数据写入磁盘。 Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm. 您还可以使用最近最少使用的算法配置将对象写入磁盘的时间量。 You can be sure of avoiding any out of memory errors, using this types of cache implementations. 使用这种类型的缓存实现,您可以确保避免任何内存不足错误。 It only increases the IO operations of your application by a small degree. 它只会在很小程度上增加应用程序的IO操作。
This is just a birds eye view of the configuration. 这只是配置的鸟瞰图。 There are a lot of configurations to optimize your requirements. 有许多配置可以优化您的要求。

From your description, it seems that instead of an ArrayList of HashMaps you rather want a (Linked)HashMap of ArrayList (each ArrayList would be a column). 根据您的描述,似乎您不想使用HashMaps的ArrayList而是需要ArrayList的(链接)HashMap (每个ArrayList都是一列)。

I'd add a double map from field-name to column-number, and some clever getters/setters that never throw IndexOutOfBoundsException . 我将从field-name添加到map-number的双映射,以及一些从不抛出IndexOutOfBoundsException聪明的getter / setter。

You can also use a ArrayList<ArrayList<Object>> (basically a jagged dinamically growing matrix) and keep the mapping to field (column) names outside. 您还可以使用ArrayList<ArrayList<Object>> (基本上是锯齿状的dinamically增长矩阵)并保持映射到外部的字段(列)名称。

Some columns will have a lot of repeated values 有些列会有很多重复的值

I doubt this matters, specially if they are Strings, (they are internalized) and your collection would store references to them. 我怀疑这很重要,特别是如果它们是字符串,(它们是内化的),你的收藏将存储对它们的引用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM