简体   繁体   中英

Handle large data structure in Java

I'm working on a Java application that needs working on very large matrices. For example multiplying two 10 million * 10 million matrices. Of course the Java heap does not have enough space even for storing one of these matrices? What should I do? Should I use databases to store my matrices and bring to memory every needed part and multiply it part after another?

First off, a 10 million x 10 million matrix is simply enormous. Assuming doubles for each cell and no storage overhaed, each one of these things is going to be 800 terabytes. Just reading each cell once over from main memory (should it somehow magically fit there, which clearly isn't happening), would take days. Doing it from any sort of plausible SAN (we'll put it on 10GbE) is more likely to be months. And no matrix multiply has O(n) complexity - the normal approaches are O(n^3). So... you aren't doing this with memory mapped files, common databases, or anything of that sort.

Code doing something like this is going to live or die on cache efficiency, where "cache" includes making good use of main memory, local disk drives. Since any storage interface holding more than one 800 terabyte matrix is bound to be a SAN of some sort, you almost certainly involve multiple servers reading and working on different parts of it, too.

There are lots of well-known ways to parallelise matrix multiplication (essentially multiply various-sized sub-matrices and then combining the results), and shift layout so that the access patterns have reasonable cache locality by organizing the data around space-filling curves instead of row/column arrangements. You're certainly going to want to look at the classic LAPACK interfaces and design, Intel's MKL , GotoBLAS as implementations of the BLAS functions tuned to specific modern hardware, and after that you're probably venturing into unexplored territory:-)

The complexity of matrix multiplication, if carried out naively, is O(n^3), but more efficient algorithms do exist. Anyway for a 10 millions * 10 millions matrix this is going to take a very long time and you may will face the same heap probelm but with recursivity.

If you're into complex maths you may find tool to help you in this article .

consider using a memory db like http://hsqldb.org/

Since this is such a huge calculation, I think you're going to run into performance problems alongside your storage problems. So I would look at parallelising this problem, and getting mutliple machines/cores to process a subset of data.

Luckily a matrix multiplication solution will decompose naturally. But I would be looking at some form of grid or distributed computing solution.

Use whatever sparse matrix algorithm applies to your data. ( on the assumption that you don't have 2.4 PB of disk space to hold 3 off 10^8 square non-sparse matrices of doubles, let alone that much RAM for an in-memory database - Blue Gene/Q 'only' has 1.6 PB.)

Well if you are forced to use Java and can't write the code that deals with this as native methods (that is, by telling Java to call some C code instead) then the most efficient thing to do would properly be to use a simple binary file. I would stay away from databases in this case because they are slower than direct file access and you don't need the features they offer.

Have a look at hadoop .

Try using Memory Mapped File by storing all your data in an external file and access it via FileChannel object.

Check out this article for a brief introduction to MMF.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM