有没有比遍历 numpy arrays 更快的方法？

Question

If I have two numpy arrays of values;如果我有两个 numpy arrays 的值； how can I quickly make a third array that gives me the number of times I have the same two values in the first two arrays?我怎样才能快速制作第三个数组，让我知道前两个 arrays 中两个值相同的次数？

For example:例如：

x = np.round(np.random.random(2500),2)
xIndex = np.linspace(0, 1, 100)

y = np.round(np.random.random(2500)*10,2)
yIndex = np.linspace(0, 10, 1000)

z = np.zeros((100,1000))

Right now, I'm doing the following loop (which is prohibitively slow):现在，我正在执行以下循环（速度非常慢）：

for m in x:
    for n in y:
        q = np.where(xIndex == m)[0][0]
        l = np.where(yIndex == n)[0][0]
        z[q][l] += 1

Then I can do a contour plot (or heat map, or whatever) of xIndex, yIndex, and z.然后我可以做 xIndex、yIndex 和 z 的轮廓 plot（或加热 map，或其他）。 But I know I'm not doing a "Pythonic" way of solving this, and there's just no way for me to run over the hundreds of millions of data points I have for this in anything approaching a reasonable timeframe.但我知道我并没有采用“Pythonic”方式来解决这个问题，而且我无法在任何接近合理的时间范围内运行我为此拥有的数亿个数据点。

How do I do this the right way?我该如何以正确的方式做到这一点？ Thanks for reading!谢谢阅读！

Answer 1

You can truncate the code dramatically.您可以显着截断代码。

First, since you have a linear scale at which you're binning, you can eliminate the explicit arrays xIndex and yIndex entirely.首先，由于您有一个线性刻度，您可以在其中进行装箱，因此您可以完全消除明确的 arrays xIndex和yIndex 。 You can express the exact indices into z as您可以将确切的索引表示为z

xi = np.round(np.random.random(2500) * 100).astype(int)
yi = np.round(np.random.random(2500) * 1000).astype(int)

Second, you don't need the loop.其次，你不需要循环。 The issue with the normal + operator (akanp.add ) is that it's buffered.普通+运算符（又名np.add ）的问题是它被缓冲了。 A consequence of that is that you won't get the right count for multiple occurrencs of the same index.这样做的结果是您不会对同一索引的多次出现获得正确的计数。 Fortunately, ufuncs have an at method to handle that, and add is a ufunc.幸运的是，ufunc 有一个at方法来处理它， add是一个 ufunc。

Third, and finally, broadcasting allows you to specify how to mesh the arrays for a fancy index:第三，也是最后，广播允许您指定如何将 arrays 网格化以获得一个奇特的索引：

np.add.at(z, (xi[:, None], yi), 1)

If you're building a 2D histogram, you don't need to round the raw data.如果您正在构建 2D 直方图，则不需要对原始数据进行舍入。 You can round just the indices instead:您可以只舍入索引：

x = np.random.random(2500)
y = np.random.random(2500) * 10

z = np.zeros((100,1000))
np.add.at(z, (np.round(100 * x).astype(int), np.round(100 * y).astype(int)), 1)

有没有比遍历 numpy arrays 更快的方法？

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-09-22 20:21:02

有没有比遍历 numpy arrays 更快的方法？

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-09-22 20:21:02

解决方案1
4 已采纳 2020-09-22 20:21:02