简体   繁体   English

是否可以 np.concatenate 内存映射文件?

[英]Is it possible to np.concatenate memory-mapped files?

I saved a couple of numpy arrays with np.save(), and put together they're quite huge.我用 np.save() 保存了几个 numpy 数组,并将它们放在一起非常大。

Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?是否可以将它们全部加载为内存映射文件,然后将它们全部连接并切片,而无需将任何内容加载到内存中?

Using numpy.concatenate apparently load the arrays into memory.使用numpy.concatenate显然将数组加载到内存中。 To avoid this you can easily create a thrid memmap array in a new file and read the values from the arrays you wish to concatenate.为了避免这种情况,您可以轻松地在新文件中创建第三个memmap数组,并从您希望连接的数组中读取值。 In a more efficient way, you can also append new arrays to an already existing file on disk.以更有效的方式,您还可以将新数组附加到磁盘上现有的文件。

For any case you must choose the right order for the array (row-major or column-major).在任何情况下,您都必须为数组选择正确的顺序(行优先或列优先)。

The following examples illustrate how to concatenate along axis 0 and axis 1.以下示例说明了如何沿轴 0 和轴 1 进行串联。


1) concatenate along axis=0 1) 沿axis=0连接axis=0

a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222

You can define a third array reading the same file as the first array to be concatenated (here a ) in mode r+ (read and append), but with the shape of the final array you want to achieve after concatenation, like:您可以在模式r+ (读取和追加)中定义第三个数组,读取与要连接的第一个数组(此处为a )相同的文件,但使用连接后要实现的最终数组的形状,例如:

c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b

Concatenating along axis=0 does not require to pass order='C' because this is already the default order.沿axis=0连接不需要传递order='C'因为这已经是默认顺序。


2) concatenate along axis=1 2)沿axis=1连接axis=1

a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222

The arrays saved on disk are actually flattened, so if you create c with mode=r+ and shape=(5000,4000) without changing the array order, the 1000 first elements from the second line in a will go to the first in line in c .保存在磁盘阵列实际上变平,因此,如果你创建cmode=r+shape=(5000,4000)而不改变所述阵列顺序, 1000从第二行第一元件在a将进入第一线在c But you can easily avoid this passing order='F' (column-major) to memmap :但是您可以轻松避免将order='F' (column-major) 传递给memmap

c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b

Here you have an updated file 'a.array' with the concatenation result.这里有一个带有连接结果的更新文件“a.array”。 You may repeat this process to concatenate in pairs of two.您可以重复此过程以成对连接两个。

Related questions:相关问题:

Maybe an alternative solution, but I also had a single multidimensional array spread over multiple files which I only wanted to read .也许是另一种解决方案,但我也有一个多维数组分布在多个文件中,我只想读取. I solved this issue with dask concatenation .我用dask concatenation解决了这个问题。

import numpy as np
import dask.array as da
 
a = np.memmap('a.array', dtype='float64', mode='r', shape=( 5000,1000))
b = np.memmap('b.array', dtype='float64', mode='r', shape=(15000,1000))

c = da.concatenate([a, b], axis=0)

This way one avoids the hacky additional file handle.这样可以避免额外的文件句柄。 The dask array can then be sliced and worked with almost like any numpy array, and when it comes time to calculate a result one calls compute .然后可以将 dask 数组切片并几乎像任何 numpy 数组一样使用,并且在需要计算结果时调用compute

Note that there are two caveats:请注意,有两个警告:

  1. it is not possible to do in-place re-assignment eg c[::2] = 0 is not possible, so creative solutions are necessary in those cases.不可能进行就地重新分配,例如c[::2] = 0是不可能的,因此在这些情况下需要创造性的解决方案。
  2. this also means the original files can no longer be updated.这也意味着无法再更新原始文件。 To save results out, the dask store methods should be used.为了保存结果,应该使用 dask store方法。 This method can again accept a memmapped array.这个方法可以再次接受一个memmapped数组。

If u use order='F' ,will leads another problem, which when u load the file next time it will be quit a mess even pass the order='F .如果你使用order='F' ,会导致另一个问题,当你下次加载文件时,即使通过order='F也会退出一团糟。 So my solution is below, I have test a lot, it just work fine.所以我的解决方案如下,我测试了很多,它工作正常。

fp = your old memmap...
shape = fp.shape
data = your ndarray...
data_shape = data.shape
concat_shape = data_shape[:-1] + (data_shape[-1] + shape[-1],)
print('cancat shape:{}'.format(concat_shape))
new_fp = np.memmap(new_file_name, dtype='float32', mode='r+', shape=concat_shape)
if len(concat_shape) == 1:
    new_fp[:shape[0]] = fp[:]
    new_fp[shape[0]:] = data[:]
if len(concat_shape) == 2:
    new_fp[:, :shape[-1]] = fp[:]
    new_fp[:, shape[-1]:] = data[:]
elif len(concat_shape) == 3:
    new_fp[:, :, :shape[-1]] = fp[:]
    new_fp[:, :, shape[-1]:] = data[:]
fp = new_fp
fp.flush()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM