简体   繁体   English

numpy 的 memmap 写时复制模式是如何工作的?

[英]How does numpy's memmap copy-on-write mode work?

I'm confused by how numpy's memmap handles changes to data when using copy-on-write ( mmap_mode=c ).当使用写时复制( mmap_mode=c )时,我对 numpy 的memmap如何处理数据更改感到困惑。 Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element.由于没有将任何内容写入磁盘上的原始数组,我希望它必须将所有更改存储在内存中,因此如果您修改每个元素,可能会耗尽内存。 To my surprise, it didn't.令我惊讶的是,它没有。

I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time).我正在尝试减少在共享集群上运行的机器学习脚本的内存使用量(每个实例占用的内存越少,我可以同时运行的实例越多)。 My data are very large numpy arrays (each > 8 Gb).我的数据是非常大的 numpy 数组(每个 > 8 Gb)。 My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).我希望使用np.memmap来处理这些具有小内存(<4Gb 可用)的数组。

However, each instance might modify the data differently (eg might choose to normalize the input data differently each time).然而,每个实例可能会以不同的方式修改数据(例如,可能会选择每次以不同的方式标准化输入数据)。 This has implications for storage space.这对存储空间有影响。 If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.如果我使用r+模式,那么在我的脚本中规范化数组将永久更改存储的数组。

Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays.由于我不想要数据的冗余副本,而只想将原始数据存储在磁盘上,因此我认为我应该使用'c'模式(写时复制)来打开阵列。 But then where do your changes go?但是你的改变去哪里了? Are the changes kept just in memory?更改是否仅保存在内存中? If so, if I change the whole array won't I run out of memory on a small-memory system?如果是这样,如果我更改整个阵列,我会不会在小内存系统上耗尽内存?

Here's an example of a test which I expected to fail:这是我预计会失败的测试示例:

On a large memory system, create the array:在大内存系统上,创建数组:

import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB

Now, on a machine with just 2 Gb of memory, this fails as expected:现在,在只有 2 Gb 内存的机器上,这会按预期失败:

a = np.load('a.npy')

But these two will succeed, as expected:但正如预期的那样,这两个会成功:

a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')

Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):问题 1:我运行此代码时内存不足,试图修改 memmaped 数组(无论 r+/c 模式如何都失败):

for i in range(a.shape[0]):
    print('row {}'.format(i))
    a[i,:] = i*np.arange(a.shape[1])

Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)?为什么这会失败(特别是,为什么即使在r+模式下它也会失败,它可以写入磁盘)? I thought memmap would only load pieces of the array into memory?我认为memmap只会将数组的一部分加载到内存中?

Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop.问题 2:当我强制 numpy 每隔一段时间刷新一次更改时,两种 r+/c 模式都成功完成了循环。 But how can c mode do this?但是c模式如何做到这一点呢? I didn't think flush() would do anything for c mode?我不认为flush()会对c模式做任何事情吗? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?更改不会写入磁盘,因此它们保存在内存中,但不知何故,必须超过 3Gb 的所有更改不会导致内存不足错误?

for i in range(a.shape[0]):
    if i % 100 == 0:
        print('row {}'.format(i))
        a.flush()
    a[i,:] = i*np.arange(a.shape[1])

Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that: Numpy 在这里没有做任何聪明的事情,它只是遵循内置的memmap模块,它有一个access参数:

accepts one of four values: ACCESS_READ , ACCESS_WRITE , or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.接受以下四个值之一: ACCESS_READACCESS_WRITEACCESS_COPY分别指定只读、直写或写时复制内存

On linux, this works by calling the mmap system call with在 linux 上,这是通过调用mmap系统调用来实现的

MAP_PRIVATE

Create a private copy-on-write mapping.创建私有的写时复制映射。 Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.映射的更新对映射同一文件的其他进程不可见,也不会传递到底层文件。

Regarding your question关于你的问题

The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?更改不会写入磁盘,因此它们保存在内存中,但不知何故,必须超过 3Gb 的所有更改不会导致内存不足错误?

The changes likely are written to disk, but just not to the file you opened.更改可能写入磁盘,但不会写入您打开的文件。 They're likely paged into virtual memory somewhere.它们很可能被分页到某个地方的虚拟内存中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM