简体   繁体   中英

Creating very large NUMPY arrays in small chunks (PyTables vs. numpy.memmap)

There are a bunch of questions on SO that appear to be the same, but they don't really answer my question fully. I think this is a pretty common use-case for computational scientists, so I'm creating a new question.

QUESTION:

I read in several small numpy arrays from files (~10 MB each) and do some processing on them. I want to create a larger array (~1 TB) where each dimension in the array contains the data from one of these smaller files. Any method that tries to create the whole larger array (or a substantial part of it) in the RAM is not suitable, since it floods up the RAM and brings the machine to a halt. So I need to be able to initialize the larger array and fill it in small batches, so that each batch gets written to the larger array on disk.

I initially thought that numpy.memmap is the way to go, but when I issue a command like

mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))

the RAM floods and the machine slows to a halt.

After poking around a bit it seems like PyTables might be well suited for this sort of thing, but I'm not really sure. Also, it was hard to find a simple example in the doc or elsewhere which illustrates this common use-case.

IF anyone knows how this can be done using PyTables, or if there's a more efficient/faster way to do this, please let me know! Any refs. to examples appreciated!

That's weird. The np.memmap should work. I've been using it with 250Gb data on a 12Gb RAM machine without problems.

Does the system really runs out of memory at the very moment of the creation of the memmap file? Or it happens along the code? If it happens at the file creation I really don't know what the problem would be.

When I started using memmap I've made some mistakes that led me to memory run out. For me, something like the below code should work:

mmapData = np.memmap(mmapFile, mode='w+', shape = (smallarray_size,number_of_arrays), dtype ='float64')

for k in range(number_of_arrays):
  smallarray = np.fromfile(list_of_files[k]) # list_of_file is the list with the files name
  smallarray = do_something_with_array(smallarray)
  mmapData[:,k] = smallarray

It may not be the most efficient way, but it seems to me that it would have the lowest memory usage.

Ps: Be aware that the default dtype value for memmap(int) and fromfile(float) are different!

HDF5 is a C library that can efficiently store large on-disk arrays. Both PyTables and h5py are Python libraries on top of HDF5. If you're using tabular data then PyTables might be preferred; if you have just plain arrays then h5py is probably more stable/simpler.

There are out-of-core numpy array solutions that handle the chunking for you. Dask.array would give you plain numpy semantics on top of your collection of chunked files (see docs on stacking .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM