简体   繁体   English

通过一组值过滤 numpy 数组的最快方法

[英]Fastest way to filter a numpy array by a set of values

I am pretty new to numpy, I also am using pypy 2.2 which has limited numpy support (see http://buildbot.pypy.org/numpy-status/latest.html ) but what I'm trying to do is filter an array by a set of values (ie keep subarray if it contains a value in a set).我对 numpy 很陌生,我也在使用 pypy 2.2,它对 numpy 的支持有限(参见http://buildbot.pypy.org/numpy-status/latest.html )但我想做的是过滤一个数组通过一组值(即,如果子数组包含一组值,则保留子数组)。 I can do with a list comprehension but I'd rather do without the intermediate list as on longer arrays it isn't fast and I can't help but think numpy filtering will be faster.我可以使用列表理解,但我宁愿不使用中间列表,因为在较长的数组上它并不快,而且我不禁认为 numpy 过滤会更快。

>> a = np.array([[   368,    322, 175238,      2],
       [   430,    382, 121486,      2],
       [   451,    412, 153521,      2],
       [   480,    442, 121468,      2],
       [   517,    475, 109543,      2],
       [   543,    503, 121471,      2],
       [   576,    537, 100566,      2],
       [   607,    567, 121473,      2],
       [   640,    597, 153561,      2]])

>> b = {121486, 153521, 121473}

>> np.array([x for x in a if x[2] in b])

>> array([[   430,    382, 121486,      2],
   [   451,    412, 153521,      2],
   [   607,    567, 121473,      2]])

You can do it in one line, but you have to use list(b) , so it might not actually be any faster: 你可以在一行中完成它,但你必须使用list(b) ,所以实际上它可能不会更快:

>>> a[np.in1d(a[:,2], list(b))]
array([[   430,    382, 121486,      2],
       [   451,    412, 153521,      2],
       [   607,    567, 121473,      2]])

It works because np.in1d tells you which of the first item are in the second: 它的工作原理是因为np.in1d告诉你第一项中哪一项在第二项中:

>>> np.in1d(a[:,2], list(b))
array([False,  True,  True, False, False, False, False,  True, False], dtype=bool)

For large a and b , this is probably faster than your solution, as it still uses b as a set, but builds only boolean array instead of rebuilding the entire array one line at a time. 对于大的ab ,这可能比你的解决方案更快,因为它仍然使用b作为集合,但是只构建布尔数组而不是一次一行地重建整个数组。 For large a and small b , I think np.in1d might be faster. 对于大a和小b ,我认为np.in1d可能更快。

ainb = np.array([x in b for x in a[:,2]])
a[ainb]

For small a and large b , your own solution is probably fastest. 对于小ab ,自己的解决方案可能是最快的。

For relatively small inputs like those in your question, the fastest method is by far and large the naïve one:对于像您问题中的那些相对较小的输入,最快的方法是迄今为止最简单的方法:

np.array([x for x in a if x[2] in b])

This would be true especially for PyPy.对于 PyPy 尤其如此。

For larger inputs, @askewchan solution with NumPy may be faster:对于更大的输入,使用 NumPy 的@askewchan 解决方案可能会更快:

a[np.in1d(a[:,2], list(b))]

However, when using CPython, a Numba-based implementation would be even faster (at all scale):但是,当使用 CPython 时,基于 Numba 的实现会更快(在所有规模上):

import numpy as np
import numba as nb


@nb.jit
def custom_filter(arr, values):
    values = set(values)
    n, m = arr.shape
    result = np.empty((n, m), dtype=arr.dtype)
    k = 0
    for i in range(n):
        if arr[i, 2] in values:
            result[k, :] = arr[i, :]
            k += 1
    return result[:k, :].copy()


@nb.jit
def custom_filter2(arr, values):
    values = set(values)
    n, m = arr.shape
    k = 0
    for i in range(n):
        if arr[i, 2] in values:
            k += 1
    result = np.empty((k, m), dtype=arr.dtype)
    k = 0
    for i in range(n):
        if arr[i, 2] in values:
            result[k, :] = arr[i, :]
            k += 1
    return result

A quick glimpse into benchmarks:快速浏览基准:

aa = np.tile(a, (1000, 1))
bb = set(list(range(121000, 122000)))


%timeit np.array([x for x in a if x[2] in b])
# 8.54 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit custom_filter(a, tuple(b))
# 1.59 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit custom_filter2(a, tuple(b))
# 1.45 µs ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a[np.in1d(a[:,2], tuple(b))]
# 25.2 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.array([x for x in aa if x[2] in b])
# 6.76 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit custom_filter(aa, tuple(b))
# 90.6 µs ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit custom_filter2(aa, tuple(b))
# 135 µs ± 5.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit aa[np.in1d(aa[:, 2], tuple(b))]
# 147 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.array([x for x in aa if x[2] in bb])
# 7.26 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit custom_filter(aa, tuple(bb))
# 226 µs ± 5.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit custom_filter2(aa, tuple(bb))
# 278 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit aa[np.in1d(aa[:, 2], tuple(bb))]
# 756 µs ± 62.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM