简体   繁体   English

用于存储矩阵的Java大型数据结构

[英]Java large datastructure for storing a matrix

I need to store a 2d matrix containing zip codes and the distance in km between each one of them. 我需要存储一个包含邮政编码的2d矩阵,以及每个矩阵之间的距离。 My client has an application that calculates the distances which are then stored in an Excel file. 我的客户端有一个计算距离的应用程序,然后存储在Excel文件中。 Currently, there are 952 places. 目前,有952个地方。 So the matrix would have 952x952 = 906304 entries. 因此矩阵将具有952x952 = 906304个条目。

I tried to map this into a HashMap[Integer, Float]. 我试图将其映射到HashMap [Integer,Float]。 The Integer is the hash code of the two Strings for two places, eg "A" and "B". Integer是两个字符串的哈希码,例如“A”和“B”。 The float value is the distance in km between them. 浮点值是它们之间的距离(以km为单位)。

While filling in the data I run into OutOfMemoryExceptions after 205k entries. 填写数据时,我会在205k条目后运行OutOfMemoryExceptions。 Do you have a tip how I can store this in a clever way? 你有一个提示,我怎么能以聪明的方式存储它? I even don't know if it's clever to have the whole bunch in memory. 我甚至不知道在整个记忆中是否都很聪明。 My options are SQL and MS Access... 我的选择是SQL和MS Access ......

The problem is that I need to access the data very quickly and possibly very often which is why I chose the HashMap because it runs in O(1) for the look up. 问题是我需要非常快速地访问数据,这可能是我选择HashMap的原因,因为它在O(1)中运行以进行查找。

Thansk for your replies and suggestions! Thansk的回复和建议!

Marco 马尔科

A 2d array would be more memory efficient. 2d阵列将更有效地记忆。 You can use a small hashmap to map the 952 places into a number between 0 and 951 . 您可以使用小型散列映射将952个位置映射到0到951之间的数字。 Then, just do: 然后,就这样做:

float[][] distances= new float[952][952];

To look things up, just use two hash lookups to convert the two places into two integers, and use them as indexes into the 2d array. 要查找,只需使用两个哈希查找将两个位置转换为两个整数,并将它们用作二维数组的索引。

By doing it this way, you avoid the boxing of floats, and also the memory overhead of the large hashmap. 通过这种方式,你可以避免浮动的装箱,以及大型hashmap的内存开销。

However, 906304 really isn't that many entries, you may just need to increase the Xmx maximum heap size 但是,906304确实不是那么多条目,你可能只需要增加Xmx的最大堆大小

I would have thought that you could calculate the distances on the fly. 我原本以为你可以动态计算距离。 Presumably someone has already done this, so you simply need to find out what algorithm they used, and the input data; 据推测有人已经这样做了,所以你只需要找出他们使用的算法和输入数据; eg longitude/latitude of the notional centres of each ZIP code. 例如每个邮政编码的名义中心的经度/纬度。

EDIT: There are two commonly used algorithms for finding the (approximate) geodesic distance between two points given by longitude/latitude pairs. 编辑:有两种常用的算法用于查找由经度/纬度对给出的两点之间的(近似)测地距离。

  • The Vicenty formula is based on an ellipsoid approximation. Vicenty公式基于椭球近似。 It is more accurate, but more complicated to implement. 它更准确,但实施起来更复杂。

  • The Haversine formula is based on a spherical approximation. Haversine公式基于球面近似。 It is less accurate (0.3%), but simpler to implement. 它不太准确(0.3%),但实施起来更简单。

Can you simply boost the memory available to the JVM ? 你能简单地增加JVM可用的内存吗?

java -Xmx512m ...

By default the maximum memory configuration is 64Mb. 默认情况下 ,最大内存配置为64Mb。 Some more tuning tips here . 这里还有一些调整技巧。 If you can do this then you can keep the data in-process and maximise the performance (ie you don't need to calculate on the fly). 如果您可以这样做,那么您可以保持数据在进行中并最大化性能(即您不需要动态计算)。

I upvoted Chi's and Benjamin's answers, because they're telling you what you need to do, but while I'm here, I'd like to stress that using the hashcode of the two strings directly will get you into trouble. 我赞同Chi和Benjamin的答案,因为他们告诉你你需要做什么,但是当我在这里时,我想强调直接使用两个字符串的哈希码会让你陷入困境。 You're likely to run into the problem of hash collisions. 您可能会遇到哈希冲突的问题。

This would not be a problem if you were concatenating the two strings (being careful to use a delimiter which cannot appear in the place designators), and letting HashMap do its magic, but the method you suggested, using the hashcodes for the two strings as a key, that's going to get you into trouble. 如果你连接两个字符串(小心使用不能出现在地方指示符中的分隔符),并让HashMap发挥其魔力,但是你建议的方法,使用两个字符串的哈希码,这不会是一个问题。一把钥匙,这会让你陷入困境。

You will simply need more memory. 你只需要更多的内存。 When starting your Java process, kick it off like so: 在启动Java进程时,请将其启动:

java -Xmx256M MyClass java -Xmx256M MyClass

The -Xmx defines the max heap size, so this says the process can use up to 256 MB of memory for the heap. -Xmx定义了最大堆大小,因此这表示进程最多可以为堆使用256 MB内存。 If you still run out, keep bumping that number up until you hit the physical limit. 如果你仍然用光,请继续撞到该数字,直到达到物理极限。

Lately I've managed similar requisites for my master thesis. 最近我为我的硕士论文设定了类似的要求。

I ended with a Matrix class that uses a double[] , not a double[][] , in order to alleviate double deref costs ( data[i] that is an array, then array[i][j] that is a double ) while allowing the VM to allocate a big, contiguous chunk of memory: 我以一个使用double[]而不是double[][]的Matrix类结束,以减轻双重deref成本( data[i]是一个数组,然后array[i][j]是一个double )允许VM分配一大块连续的内存:

public class Matrix {

    private final double data[];
    private final int rows;
    private final int columns;

    public Matrix(int rows, int columns, double[][] initializer) {
        this.rows = rows;
        this.columns = columns;
        this.data = new double[rows * columns];

        int k = 0;

        for (int i = 0; i < initializer.length; i++) {
            System.arraycopy(initializer[i], 0, data, k, initializer[i].length);
            k += initializer[i].length;
        }
    }

    public Matrix set(int i, int j, double value) {
        data[j + i * columns] = value;
        return this;
    }

    public double get(int i, int j) {
        return data[j + i * columns];
    }
}

this class should use less memory than an HashMap since it uses a primitive array (no boxing needed): it needs only 906304 * 8 ~ 8 Mb (for double s) or 906304 * 4 ~ 4 Mb (for float s). 这个类应该使用比HashMap更少的内存,因为它使用原始数组 (不需要装箱):它只需要906304 * 8 ~ 8 Mb (用于 s)或906304 * 4 ~ 4 Mb (用于浮点数 )。 My 2 cents. 我的2美分。

NB I've omitted some sanity checks for simplicity's sake NB为了简单起见,我省略了一些健全性检查

The above suggestions regarding heap size will be helpful. 关于堆大小的上述建议将有所帮助。 However, I am not sure if you gave an accurate description of the size of your matrix. 但是,我不确定您是否准确描述了矩阵的大小。

Suppose you have 4 locations. 假设您有4个位置。 Then you need to assess the distances between A->B, A->C, A->D, B->C, B->D, C->D. 然后,您需要评估A-> B,A-> C,A-> D,B-> C,B-> D,C-> D之间的距离。 This suggests six entries in your HashMap (4 choose 2). 这表示您的HashMap中有六个条目(4个选择2)。

That would lead me to believe the actual optimal size of your HashMap is (952 choose 2)=452,676; 这会让我相信你的HashMap的实际最佳大小是(952选择2)= 452,676; NOT 952x952=906,304. NOT 952x952 = 906,304。

This is all assuming, of course, that you only store one-way relationships (ie from A->B, but not from B->A, since that is redundant), which I would recommend since you are already experiencing problems with memory space. 当然,这只是假设你只存储单向关系(即从A-> B,但不是从B-> A,因为那是多余的),我建议你因为你已经遇到内存问题空间。

Edit: Should have said that the size of your matrix is not optimal, rather than saying the description was not accurate. 编辑:应该说你的矩阵的大小不是最优的,而不是说描述不准确。

Stephen C. has a good point: if the distances are as-the-crow-flies, then you could probably save memory by doing some calculations on the fly. Stephen C.有一个很好的观点:如果距离像苍蝇一样,那么你可以通过动态计算来节省记忆。 All you'd need is space for the longitude and latitude for 952 zip codes and then you could use the vicenty formula to do your calculation when you need to. 所有你需要的是952邮政编码的经度和纬度空间,然后你可以使用副数公式在你需要的时候进行计算。 This would make your memory usage O(n) in zipcodes. 这将使您的内存使用O(n)在zipcodes中。

Of course, that solution makes some assumptions that may turn out to be false in your particular case, ie that you have longitude and latitude data for your zipcodes and that you're concerned with as-the-crow-flies distances and not something more complicated like driving directions. 当然,该解决方案会做出一些假设,在您的特定情况下可能会变得虚假,即您拥有拉链码的经度和纬度数据,并且您关注的是蝇蝇距离而不是更多像行车路线一样复杂。

If those assumptions are true though, trading a few computes for a whole bunch of memory might help you scale in the future if you ever need to handle a bigger dataset. 如果这些假设是正确的,那么如果您需要处理更大的数据集,那么为一大堆内存交换一些计算可能会帮助您在将来扩展。

Create a new class with 2 slots for location names. 创建一个新的类,其中包含2个位置名称插槽。 Have it always put the alphabetically first name in the first slot. 它始终将字母顺序的名字放在第一个插槽中。 Give it a proper equals and hashcode method. 给它一个适当的equals和hashcode方法。 Give it a compareTo (eg order alphabetically by names). 给它一个compareTo(例如按名称的字母顺序排序)。 Throw them all in an array. 把它们全部扔进阵列中。 Sort it. 解决。

Also, hash1 = hash2 does not imply object1 = object2. 此外,hash1 = hash2并不意味着object1 = object2。 Don't ever do this. 不要这样做。 It's a hack. 这是一个黑客。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM