简体   繁体   中英

3D Array vs Sparse Matrix for Sparse Data

I am building a simple model of the Milky Way and one of the things I need to store is a 3D grid of mass densities.

The problem is that if I put a rectangular box around the galaxy, most of the grid cells are empty. Which leaves me saving a lot of useless zeros. So the naive array seems wasteful:

galaxy = [[[0 for k in xrange(1601)] for j in xrange(1601)] for i in xrange(253)]
# then fill in i,j,k values that are non-zero

I tried to build a sparse array using a dictionary:

for x in range(1601):
    for y in range(1601):
        for z in range (253):
            galaxy[str(x) + "," + str(y) + "," + str(z)] = # whatever

But, (aside from being ugly) the strings that I was using for keys were taking up more memory than I was saving. I got OutOfMemoryError s because (I calculated) the keys alone were taking a couple of gigs of memory.

At some point, I will want to increase the resolution of my model, and that will mean a much larger grid. Is there a more efficient way to store my values than using a 3D-Array of floats?

I am also concerned about the time it takes to iterate through all the cells (or just the non-zero cells in my grid. This will be very important.

Try using the dictionary approach, but only store key:value pairs for keys whose value is non-zero. A better key might be a tuple of (x,y,z).

Quick math: 1601 * 1601 * 253 => 648489853 items . A test indicates that the dictionary takes about 24 bytes per entry on a 32-bit machine, 49 bytes on a 64-bit machine, so that's 15,563,756,472 bytes (or 30GB on 64-bit). 10% of that is 1.5GB (or 3.0GB on 64-bit). If you have a 64-bit system with a bunch of memory, I think you'll be okay with a sparse representation.

I recommend:

  1. Use a tuple as the key, not a string, and
  2. Use a sparse storage system where you don't store zero values.

Here is one possibility:

class SparseDict(dict):
  def __init__(self, default_value):
    dict.__init__(self)
    self._value = default_value
  def __getitem__(self, key):
    try:
      return dict.__getitem__(self, key)
    except KeyError:
      return self._value
  def __setitem__(self, key, val):
    # I'm sure this can go faster if I were smarter
    if val == self._value:
      if  key in self:
        del self[key]
    else:
      dict.__setitem__(self, key, val)

def test(galaxy):
  import sys
  print len(galaxy), sys.getsizeof(galaxy)

  # test is 1/10th size in each dimension,
  # so 1/1000th of the volume
  for x in range(160):
    for y in range(160):
      for z in range (25):
        import random
        # 90% of space is essentially a vacuum
        if random.random() < .1:
          galaxy[x,y,z] = 1502100
        else:
          galaxy[x,y,z] = 0

  print len(galaxy), sys.getsizeof(galaxy)

test(SparseDict(0))

Maybe try to save your data in sql tables and load only a subset of the cube according to your needs. this will cost you in time to load parts , but will save you memory. as for the in memory represntation, use the methods suggested in the other answers such as dictionary etc...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM