简体繁体 English

处理Java中的大数据结构

[英]Handle large data structure in Java

原文 2009-03-16 12:42:34 9 9 java/ memory/ matrix

I'm working on a Java application that needs working on very large matrices.我正在开发一个需要处理非常大的矩阵的 Java 应用程序。 For example multiplying two 10 million * 10 million matrices.例如将两个 1000 万 * 1000 万矩阵相乘。 Of course the Java heap does not have enough space even for storing one of these matrices?当然，Java 堆甚至没有足够的空间来存储这些矩阵之一？ What should I do?我应该怎么办？ Should I use databases to store my matrices and bring to memory every needed part and multiply it part after another?我是否应该使用数据库来存储我的矩阵并将每个需要的部分带到 memory 并一个接一个地相乘？

9 个解决方案

First off, a 10 million x 10 million matrix is simply enormous.首先，一个 1000 万 x 1000 万的矩阵简直是巨大的。 Assuming doubles for each cell and no storage overhaed, each one of these things is going to be 800 terabytes.假设每个单元加倍并且没有过度存储，这些东西中的每一个都将是 800 TB。 Just reading each cell once over from main memory (should it somehow magically fit there, which clearly isn't happening), would take days.只需从主 memory 读取每个单元格（如果它以某种方式神奇地适合那里，这显然不会发生），需要几天时间。 Doing it from any sort of plausible SAN (we'll put it on 10GbE) is more likely to be months.从任何一种合理的 SAN（我们将其放在 10GbE 上）执行此操作更可能需要几个月的时间。 And no matrix multiply has O(n) complexity - the normal approaches are O(n^3).并且没有矩阵乘法具有 O(n) 复杂度 - 正常的方法是 O(n^3)。 So... you aren't doing this with memory mapped files, common databases, or anything of that sort.所以......你不是用 memory 映射文件、公共数据库或任何类似的东西来做这个。

Code doing something like this is going to live or die on cache efficiency, where "cache" includes making good use of main memory, local disk drives.执行此类操作的代码将在缓存效率方面生死攸关，其中“缓存”包括充分利用主 memory、本地磁盘驱动器。 Since any storage interface holding more than one 800 terabyte matrix is bound to be a SAN of some sort, you almost certainly involve multiple servers reading and working on different parts of it, too.由于任何拥有超过一个 800 TB 矩阵的存储接口都必然是某种 SAN，因此您几乎肯定会涉及到多个服务器读取和处理它的不同部分。

There are lots of well-known ways to parallelise matrix multiplication (essentially multiply various-sized sub-matrices and then combining the results), and shift layout so that the access patterns have reasonable cache locality by organizing the data around space-filling curves instead of row/column arrangements.有许多众所周知的方法可以并行化矩阵乘法（本质上是将各种大小的子矩阵相乘，然后组合结果），并通过围绕空间填充曲线组织数据来改变布局，以便访问模式具有合理的缓存局部性行/列排列。 You're certainly going to want to look at the classic LAPACK interfaces and design, Intel's MKL , GotoBLAS as implementations of the BLAS functions tuned to specific modern hardware, and after that you're probably venturing into unexplored territory:-)你肯定会想看看经典的LAPACK接口和设计、英特尔的 MKL 、 GotoBLAS作为针对特定现代硬件调整的 BLAS 功能的实现，然后你可能会冒险进入未开发的领域:-)

The complexity of matrix multiplication, if carried out naively, is O(n^3), but more efficient algorithms do exist.如果简单地执行矩阵乘法的复杂性是 O(n^3)，但确实存在更有效的算法。 Anyway for a 10 millions * 10 millions matrix this is going to take a very long time and you may will face the same heap probelm but with recursivity.无论如何，对于一个 1000 万 * 1000 万的矩阵，这将需要很长时间，并且您可能会面临相同的堆问题，但具有递归性。

If you're into complex maths you may find tool to help you in this article .如果您对复杂的数学感兴趣，您可能会在本文中找到可以帮助您的工具。

consider using a memory db like http://hsqldb.org/考虑使用 memory 数据库，例如http://hsqldb.org/

Since this is such a huge calculation, I think you're going to run into performance problems alongside your storage problems.由于这是一个如此庞大的计算，我认为您将在存储问题的同时遇到性能问题。 So I would look at parallelising this problem, and getting mutliple machines/cores to process a subset of data.所以我会考虑并行化这个问题，并让多个机器/核心来处理数据子集。

Luckily a matrix multiplication solution will decompose naturally.幸运的是，矩阵乘法解决方案会自然分解。 But I would be looking at some form of grid or distributed computing solution.但我会关注某种形式的网格或分布式计算解决方案。

Use whatever sparse matrix algorithm applies to your data.使用适用于您的数据的任何稀疏矩阵算法。 ( on the assumption that you don't have 2.4 PB of disk space to hold 3 off 10^8 square non-sparse matrices of doubles, let alone that much RAM for an in-memory database - Blue Gene/Q 'only' has 1.6 PB.) （假设您没有 2.4 PB 的磁盘空间来保存 3 个 10^8 平方非稀疏双精度矩阵，更不用说内存数据库的那么多 RAM - Blue Gene/Q 'only' 有1.6 PB。）

Well if you are forced to use Java and can't write the code that deals with this as native methods (that is, by telling Java to call some C code instead) then the most efficient thing to do would properly be to use a simple binary file.好吧，如果您被迫使用 Java 并且无法编写将其作为本机方法处理的代码（也就是说，通过告诉 Java 调用一些 Z0D61F8370CAD1D412F80B80B84D143E127Z 代码来正确使用最简单的最有效的事情）二进制文件。 I would stay away from databases in this case because they are slower than direct file access and you don't need the features they offer.在这种情况下，我会远离数据库，因为它们比直接文件访问慢，而且您不需要它们提供的功能。

Have a look at hadoop .看看hadoop 。

Try using Memory Mapped File by storing all your data in an external file and access it via FileChannel object.通过将所有数据存储在外部文件中并通过 FileChannel object 访问它，尝试使用Memory 映射文件。

Check out this article for a brief introduction to MMF.查看这篇文章，了解 MMF 的简要介绍。

Have a look at CGL-MapReduce http://www.cs.indiana.edu/~jekanaya/cglmr.html#Matrix_Multiplication看看 CGL-MapReduce http://www.cs.indiana.edu/~jekanaya/cglmr.html#Matrix_Multiplication