以内存效率的方式生成包含随机布尔值的大型numpy数组

Question

I need to create a large numpy array containing random boolean values without hitting the swap. 我需要创建一个包含随机布尔值的大型numpy数组，而不需要调用swap。

My laptop has 8 GB of RAM. 我的笔记本电脑有8 GB的RAM。 Creating a (1200, 2e6) array takes less than 2 s and use 2.29 GB of RAM: 创建(1200, 2e6)阵列需要不到2秒，并使用2.29 GB的RAM：

>>> dd = np.ones((1200, int(2e6)), dtype=bool)
>>> dd.nbytes/1024./1024
2288.818359375

>>> dd.shape
(1200, 2000000)

For a relatively small (1200, 400e3) , np.random.randint is still quite fast, taking roughly 5 s to generate a 458 MB array: 对于相对较小的(1200, 400e3) ， np.random.randint仍然非常快，大约需要5秒才能生成458 MB阵列：

db = np.array(np.random.randint(2, size=(int(400e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'

But if I double the size of the array to (1200, 800e3) I hit the swap, and it takes ~2.7 min to create db ;( 但是，如果我将数组的大小加倍(1200, 800e3)我就会进行交换，创建db需要~2.7分钟;（

cmd = """
import numpy as np
db = np.array(np.random.randint(2, size=(int(800e3), 1200)), dtype=bool)
print db.nbytes/1024./1024., 'Mb'"""

print timeit.Timer(cmd).timeit(1)

Using random.getrandbits takes even longer (~8min), and also uses the swap: 使用random.getrandbits需要更长的时间（约8分钟），并且还使用swap：

from random import getrandbits
db = np.array([not getrandbits(1) for x in xrange(int(1200*800e3))], dtype=bool)

Using np.random.randint for a (1200, 2e6) just gives a MemoryError . 对于(1200, 2e6)使用np.random.randint只会产生MemoryError 。

Is there a more efficient way to create a (1200, 2e6) random boolean array? 有没有更有效的方法来创建(1200, 2e6)随机布尔数组？

Answer 1

One problem with using np.random.randint is that it generates 64-bit integers, whereas numpy's np.bool dtype uses only 8 bits to represent each boolean value. 使用np.random.randint一个问题是它生成64位整数，而numpy的np.bool仅使用8位来表示每个布尔值。 You are therefore allocating an intermediate array 8x larger than necessary. 因此，您正在分配比所需大8倍的中间数组。

A workaround that avoids intermediate 64-bit dtypes is to generate a string of random bytes using np.random.bytes , which can be converted to an array of 8-bit integers using np.fromstring . 避免中间64位dtypes的解决方法是使用np.random.bytes生成一串随机字节，可以使用np.random.bytes将其转换为8位整数np.fromstring 。 These integers can then be converted to boolean values, for example by testing whether they are less than 255 * p , where p is the desired probability of each element being True : 然后可以将这些整数转换为布尔值，例如通过测试它们是否小于255 * p ，其中p是每个元素为True的所需概率：

import numpy as np

def random_bool(shape, p=0.5):
    n = np.prod(shape)
    x = np.fromstring(np.random.bytes(n), np.uint8, n)
    return (x < 255 * p).reshape(shape)

Benchmark: 基准测试：

In [1]: shape = 1200, int(2E6)

In [2]: %timeit random_bool(shape)
1 loops, best of 3: 12.7 s per loop

One important caveat is that the probability will be rounded down to the nearest multiple of 1/256 (for an exact multiple of 1/256 such as p=1/2 this should not affect accuracy). 一个重要的警告是，概率将向下舍入到最接近的1/256的倍数（对于1/256的精确倍数，例如p = 1/2，这不应影响准确性）。

Update: 更新：

An even faster method is to exploit the fact that you only need to generate a single random bit per 0 or 1 in your output array. 更快的方法是利用以下事实：您只需要在输出数组中每0或1生成一个随机位。 You can therefore create a random array of 8-bit integers 1/8th the size of the final output, then convert it to np.bool using np.unpackbits : 因此，您可以创建一个8位整数的随机数组，其大小是最终输出的np.bool ，然后使用np.unpackbits将其转换为np.unpackbits ：

def fast_random_bool(shape):
    n = np.prod(shape)
    nb = -(-n // 8)     # ceiling division
    b = np.fromstring(np.random.bytes(nb), np.uint8, nb)
    return np.unpackbits(b)[:n].reshape(shape).view(np.bool)

For example: 例如：

In [3]: %timeit fast_random_bool(shape)
1 loops, best of 3: 5.54 s per loop

以内存效率的方式生成包含随机布尔值的大型numpy数组

问题描述

1 个解决方案

解决方案1
13 已采纳 2015-12-27 23:24:28

Update: 更新：

以内存效率的方式生成包含随机布尔值的大型numpy数组

问题描述

1 个解决方案

解决方案1 13 已采纳 2015-12-27 23:24:28

Update: 更新：

解决方案1
13 已采纳 2015-12-27 23:24:28