使用numpy数组有效索引numpy数组

Question

I have a very (very, very) large two dimensional array - on the order of a thousand columns, but a couple of million rows (enough so that it doesn't fit in to memory on my 32GB machine). 我有一个非常（非常非常大）的二维数组-大约一千列，但是有几百万行（足够多了，所以它不适合我的32GB机器上的内存）。 I want to compute the variance of each of the thousand columns. 我想计算每千列的方差。 One key fact which helps: my data is 8-bit unsigned ints. 一个有用的关键事实：我的数据是8位无符号整数。

Here's how I'm planning on approaching this. 这是我打算如何解决的方法。 I will first construct a new two dimensional array called counts with shape (1000, 256), with the idea that counts[i,:] == np.bincount(bigarray[:,i]) . 我首先构造一个新的二维阵列称为计数与形状（1000，256），与该想法counts[i,:] == np.bincount(bigarray[:,i]) Once I have this array, it will be trivial to compute the variance. 一旦有了这个数组，计算方差就很简单了。

Trouble is, I'm not sure how to compute it efficiently (this computation must be run in real-time, and I'd like bandwidth to be limited by how fast my SSD can return the data). 问题是，我不确定如何高效地进行计算（此计算必须实时运行，并且我希望带宽受SSD可以返回数据的速度限制）。 Here's something which works, but is god-awful slow: 这是可行的，但是速度慢得令人敬畏：

counts = np.array((1000,256))
for row in iterator_over_bigaray_rows():
    for i,val in enumerate(row):
        counts[i,val] += 1

Is there any way to write this to run faster? 有什么办法可以写得更快一些？ Something like this: 像这样：

counts = np.array((1000,256))
for row in iterator_over_bigaray_rows():
    counts[i,:] = // magic np one-liner to do what I want

Answer 1

I think this is what you want: 我认为这是您想要的：

counts[np.arange(1000), row] += 1

But if your array has millions of rows, you are still going to have to iterate over millions of those. 但是，如果您的数组具有数百万行，则仍然需要遍历数百万行。 The following trick gives close to a 5x speed-up on my system: 以下技巧可将我的系统的速度提高近5倍：

chunk = np.random.randint(256, size=(1000, 1000))

def count_chunk(chunk):
    rows, cols = chunk.shape
    col_idx = np.arange(cols) * 256
    counts = np.bincount((col_idx[None, :] + chunk).ravel(),
                         minlength=256*cols)
    return counts.reshape(-1, 256)

def count_chunk_by_rows(chunk):
    counts = np.zeros(chunk.shape[1:]+(256,), dtype=np.int)
    indices = np.arange(chunk.shape[-1])
    for row in chunk:
        counts[indices, row] += 1
    return counts

And now: 现在：

In [2]: c = count_chunk_by_rows(chunk)

In [3]: d = count_chunk(chunk)

In [4]: np.all(c == d)
Out[4]: True

In [5]: %timeit count_chunk_by_rows(chunk)
10 loops, best of 3: 80.5 ms per loop

In [6]: %timeit count_chunk(chunk)
100 loops, best of 3: 13.8 ms per loop

使用numpy数组有效索引numpy数组

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-03-27 20:54:23

使用numpy数组有效索引numpy数组

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-03-27 20:54:23

解决方案1
1 已采纳 2013-03-27 20:54:23