简体   繁体   English

numpy:获取数组X的最低N个元素,仅考虑其索引不是另一个数组Y中的元素的元素

[英]Numpy: get the lowest N elements of an array X, considering only elements whose index is not an element in another array Y

To get the lowest 10 values of an array XI do something like: 要获得数组XI的最低10个值,请执行以下操作:

lowest10 = np.argsort(X)[:10]

what is the most efficient way, avoiding loops, to filter the results so that I get the lowest 10 values whose index is not an element of another array Y? 避免循环过滤结果的最有效方法是什么,以便获得索引不是另一个数组Y的元素的最低10个值?

So for example if the array Y is: 因此,例如,如果数组Y为:

[2,20,51]

X[2], X[20] and X[51] shouldn't be taken into consideration to compute the lowest 10. X [2],X [20]和X [51]在计算最低10时不应考虑在内。

After some benchmarking here is my humble recommendation: 经过一些基准测试之后,我提出了一个卑微的建议:

Swapping out appears to be more or less always faster than masking (even if 99% of X are forbidden.) So use something along the lines of 交换似乎总是比遮罩更快(即使X的99%被禁止)也是如此。

swap = X[Y]
X[Y] = np.inf

Sorting is expensive, therefore use argpartition and only sort what's necessary. 排序很昂贵,因此请使用argpartition ,仅对必要的内容进行排序。 Like 喜欢

lowest10 = np.argpartition(Xfiltered, 10)[:10]
lowest10 = lowest10[np.argsort(Xfiltered[lowest10])]

Here are some benchmarks: 以下是一些基准:

import numpy as np
from timeit import timeit

def swap_out():
    global sol

    swap = X[Y]
    X[Y] = np.inf

    sol = np.argpartition(X, K)[:K]
    sol = sol[np.argsort(X[sol])]

    X[Y] = swap

def app1():
    sidx = X.argsort()
    return sidx[~np.in1d(sidx, Y)][:K]

def app2():
    sidx = np.argpartition(X,range(K+Y.size))
    return sidx[~np.in1d(sidx, Y)][:K]

def app3():
    sidx = np.argpartition(X,K+Y.size)
    return sidx[~np.in1d(sidx, Y)][:K]


K = 10    # number of small elements wanted
N = 10000 # size of X
M = 10    # size of Y
S = 10    # number of repeats in benchmark

X = np.random.random((N,))
Y = np.random.choice(N, (M,))

so = timeit(swap_out, number=S)
print(sol)
print(X[sol])
d1 = timeit(app1, number=S)
print(sol)
print(X[sol])
d2 = timeit(app2, number=S)
print(sol)
print(X[sol])
d3 = timeit(app3, number=S)
print(sol)
print(X[sol])

print('pp', f'{so:8.5f}', '  d1(um)', f'{d1:8.5f}', '  d2', f'{d2:8.5f}', '  d3', f'{d3:8.5f}')
# pp  0.00053   d1(um)  0.00731   d2  0.00313   d3  0.00149

You can work on a subset of original array using numpy.delete() ; 您可以使用numpy.delete()处理原始数组的子集;

lowest10 = np.argsort(np.delete(X, Y))[:10]

Since delete works by slicing the original array with indexes to keep, complexity should be constant. 由于delete通过将原始数组切成要保留的索引来进行工作,因此复杂度应保持不变。


Warning: This solution uses a subset of original X array (X without the elements indexed in Y), thus the end result will be the lowest 10 of that subset. 警告:此解决方案使用原始X数组的子集(X,其中的元素未在Y中进行索引),因此最终结果将是该子集的最低10。

Here's one approach - 这是一种方法-

sidx = X.argsort()
idx_out = sidx[~np.in1d(sidx, Y)][:10]

Sample run - 样品运行-

# Setup inputs
In [141]: X = np.random.choice(range(60), 60)

In [142]: Y = np.array([2,20,51])

# For testing, let's set the Y positions as 0s and 
# we want to see them skipped in o/p
In [143]: X[Y] = 0

# Use proposed approach
In [144]: sidx = X.argsort()

In [145]: X[sidx[~np.in1d(sidx, Y)][:10]]
Out[145]: array([ 0,  2,  4,  5,  5,  9,  9, 10, 12, 14])

# Print the first 13 numbers and skip three 0s and 
# that should match up with the output from proposed approach
In [146]: np.sort(X)[:13]
Out[146]: array([ 0,  0,  0,  0,  2,  4,  5,  5,  9,  9, 10, 12, 14])

Alternatively, for performance, we might want to use np.argpartition , like so - 另外,为了提高性能,我们可能要使用np.argpartition ,如下所示-

sidx = np.argpartition(X,range(10+Y.size))
idx_out = X[sidx[~np.in1d(sidx, Y)][:10]]

This would be beneficial if the length of X is a much larger number than 10 . 如果X的长度比10大得多,这将是有益的。

If you don't care about the order of elements in that list of 10 indices, for further boost, we can simply pass on the scalar length instead of range array to np.argpartition : np.argpartition(X,10+Y.size) . 如果您不在乎10索引列表中的元素顺序,为了进一步提高,我们可以将标量长度而不是range数组传递给np.argpartitionnp.argpartition(X,10+Y.size)

We can optimize np.in1d with searchsorted to have one more approach (listing next). 我们可以使用searchsorted来优化np.in1d ,以使用另一种方法(下面列出)。


Listing below all the discussed approaches in this post - 在下面列出所有本文讨论的方法-

def app1(X, Y, n=10):
    sidx = X.argsort()
    return sidx[~np.in1d(sidx, Y)][:n]

def app2(X, Y, n=10):
    sidx = np.argpartition(X,range(n+Y.size))
    return sidx[~np.in1d(sidx, Y)][:n]

def app3(X, Y, n=10):
    sidx = np.argpartition(X,n+Y.size)
    return sidx[~np.in1d(sidx, Y)][:n]


def app4(X, Y, n=10):
    n_ext = n+Y.size
    sidx = np.argpartition(X,np.arange(n_ext))[:n_ext]
    ssidx = sidx.argsort()
    mask = np.ones(ssidx.size,dtype=bool)
    search_idx = np.searchsorted(sidx, Y, sorter=ssidx)
    search_idx[search_idx==sidx.size] = 0
    idx = ssidx[search_idx]    
    mask[idx[sidx[idx] == Y]] = 0
    return sidx[mask][:n]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM