简体   繁体   English

高效重新排列 NumPy 阵列

[英]Rearrange NumPy Array Efficiently

Let's say I have a simple 1D NumPy array:假设我有一个简单的一维 NumPy 数组:

x = np.random.rand(1000)

And I retrieve the sorted indices:我检索排序的索引:

idx = np.argsort(x)

However, I need to move a list of indices to the front of idx .但是,我需要将索引列表移动到idx的前面。 So, let's say indices = [10, 20, 30, 40, 50] need to always be the first 5 and then the rest will follow from idx (minus the indices found in indices )因此,假设indices = [10, 20, 30, 40, 50]必须始终是前 5 个,然后 rest 将从idx跟随(减去在 indices 中找到的indices

A naive way to accomplish this would be:实现此目的的一种天真的方法是:

indices = np.array([10, 20, 30, 40, 50])
out = np.empty(idx.shape[0], dtype=int64)
out[:indices.shape[0]] = indices

n = indices.shape[0]
for i in range(idx.shape[0]):
    if idx[i] not in indices:
        out[n] = idx[i] 
        n += 1

Is there a way to do this efficiently and, possibly, in-place?有没有办法有效地并且可能就地做到这一点?

You can build a mask with where the indices are contained in idx with np.in1d , and just concatenate both indexing arrays:您可以使用np.in1d构建一个掩码,其中indices包含在idx中,并且只需连接两个索引 arrays:

m = np.in1d(idx, indices)
out = np.r_[indices, idx[~m]]

Approach #1方法#1

One way would be with np.isin masking -一种方法是使用np.isin掩蔽 -

mask = np.isin(idx, indices, invert=True)
out = np.r_[indices, idx[mask]]

Approach #2: Skipping the first argsort方法 #2:跳过第一个argsort

Another with making those given indices minimum, thus forcing them to be at the start with argsorting .另一个使那些给定的索引最小化,从而迫使它们以argsorting We don't need idx for this method as we are argsort-ing in our solution anyway -这种方法不需要idx ,因为无论如何我们都在解决方案中进行 argsort-ing -

def argsort_constrained(x, indices):
    xc = x.copy()
    xc[indices] = x.min()-np.arange(len(indices),0,-1)
    return xc.argsort()

Benchmarking - Closer look基准测试 - 仔细观察

Let's study how does this entire thing of skipping the computation of starting argsort idx helps us with the second approach.让我们研究一下跳过启动argsort idx的计算这一整件事如何帮助我们使用第二种方法。

We will start off with the given sample:我们将从给定的示例开始:

In [206]: x = np.random.rand(1000)

In [207]: indices = np.array([10, 20, 30, 40, 50])

In [208]: %timeit argsort_constrained(x, indices)
38.6 µs ± 1.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [209]: idx = np.argsort(x)

In [211]: %timeit np.argsort(x)
27.7 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [212]: %timeit in1d_masking(x, idx, indices)
     ...: %timeit isin_masking(x, idx, indices)
44.4 µs ± 421 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.7 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Note that if you use np.concatenate in place of np.r_ with these small datasets, you could do better.请注意,如果您使用np.concatenate代替np.r_处理这些小数据集,您可以做得更好。

So, argsort_constrained has a total runtime cost of around 38.6 µs , whereas the other two with masking have around 27.7 µs on top of their individual timing numbers.因此, argsort_constrained的总运行时间成本约为38.6 µs ,而其他两个带有掩码的时间在其各自的时序数之上大约有27.7 µs

Let's scale up everything by 10x and do the same experiments:让我们将所有内容放大10x并进行相同的实验:

In [213]: x = np.random.rand(10000)

In [214]: indices = np.sort(np.random.choice(len(x), 50, replace=False))

In [215]: %timeit argsort_constrained(x, indices)
740 µs ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [216]: idx = np.argsort(x)

In [217]: %timeit np.argsort(x)
731 µs ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [218]: %timeit in1d_masking(x, idx, indices)
     ...: %timeit isin_masking(x, idx, indices)
1.07 ms ± 47.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.02 ms ± 4.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Again, the individual runtime costs with masking ones are higher than with argsort_constrained .同样,使用掩码的单个运行时成本高于使用argsort_constrained And this trend should continue as we scale up further.随着我们进一步扩大规模,这种趋势应该会继续下去。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM