简体   繁体   English

处理条件和屏蔽 logical_and 时的 Numpy/Numba 性能问题

[英]Numpy/Numba performance issues when processing conditions and masking logical_and

I want to compute many rows of dataframe and check if a condition is then satisfied in every row for every column.我想计算 dataframe 的多行,并检查每列的每一行是否满足条件。 I implemented a straigthforward solution using for loop and compiled it using numba (test1), which run very fast, but I tought vectorizing it will bring me even faster results.我使用 for 循环实现了一个简单的解决方案,并使用 numba (test1) 编译它,它运行得非常快,但我认为矢量化它会给我带来更快的结果。

I then tried logical ands of all my conditioned arrays (test2), which is slightly faster, but I need even more faster solution, preferably < 1ms for million of columns and 20 rows.然后我尝试了所有条件 arrays (test2) 的逻辑与,它稍微快一些,但我需要更快的解决方案,最好是百万列和 20 行的 < 1ms。 As the condition in this example to be passed (all True in all rows) is with probablity of 1/32 and it will be even lower for real data, it doesn't make sense to compute all the conditions (greater or less then 0) for every row, as False value in any row can be automatically evaluated as False.由于此示例中要传递的条件(所有行中的所有条件均为 True)的概率为 1/32,并且对于真实数据来说甚至更低,因此计算所有条件(大于或小于 0)没有意义) 对于每一行,因为任何行中的 False 值都可以自动评估为 False。

So I wanted to use a masked array (test3), where I iteratively compute the True/False values for the first row and then mask only the True values in other rows, saving the needs to compute the full length of array, and the speed should even decrease with more rows, as there will be less and less masked values for last rows.所以我想使用一个掩码数组(test3),在这里我迭代地计算第一行的真/假值,然后只屏蔽其他行中的真值,节省了计算数组全长和速度的需要甚至应该随着行的增加而减少,因为最后一行的掩码值会越来越少。

Ironically the solution I tought would be the fastest is 10x slower then non-masked version.具有讽刺意味的是,我认为最快的解决方案是比非屏蔽版本慢 10 倍。 What is the issue here?这里有什么问题? Is it that masking and re-assigning True/False values is performing slower than just computing all condiditions?屏蔽和重新分配真/假值是否比仅计算所有条件的速度慢?

Is there any way how to further speed this code?有什么办法可以进一步加快这段代码的速度吗? Thanks谢谢

import numpy as np
from numba import njit

@njit
def test1(df_np):
    truth_arr = np.full(df_np.shape[0], False)
    for i in range(df_np.shape[0]):
        truth_arr[i] = df_np[i, 5] > 0 and df_np[i, 6] > 0 and df_np[i, 7] > 0 and df_np[i, 8] > 0 and df_np[i, 9] > 0    

@njit
def test2(df_np):
    return True & \
        (df_np[:, 5] > 0) & \
        (df_np[:, 6] > 0) & \
        (df_np[:, 7] > 0) & \
        (df_np[:, 8] > 0) & \
        (df_np[:, 9] > 0)

@njit
def test3(df_np):
    truth_arr = np.full(df_np.shape[0], True)
    truth_arr[truth_arr] &= (df_np[:, 5][truth_arr] > 0)
    truth_arr[truth_arr] &= (df_np[:, 6][truth_arr] > 0)
    truth_arr[truth_arr] &= (df_np[:, 7][truth_arr] > 0)
    truth_arr[truth_arr] &= (df_np[:, 8][truth_arr] > 0)
    truth_arr[truth_arr] &= (df_np[:, 9][truth_arr] > 0)

The benchmarks are:基准是:

df_np = np.random.uniform(-1,1, size=(1_000_000, 10))

%%timeit
test1(df_np)
# 8.71 ms ± 93.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
test2(df_np)
# 4.6 ms ± 97 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
test3(df_np)
# 47.7 ms ± 825 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can craft a number of additional tests:您可以制作许多额外的测试:

import numpy as np
import numba as nb


@nb.njit
def test1(df_np):
    truth_arr = np.full(df_np.shape[0], False)
    for i in range(df_np.shape[0]):
        truth_arr[i] = df_np[i, 5] > 0 and df_np[i, 6] > 0 and df_np[i, 7] > 0 and df_np[i, 8] > 0 and df_np[i, 9] > 0    
    return truth_arr


@nb.njit
def test2(df_np):
    return \
        (df_np[:, 5] > 0) & \
        (df_np[:, 6] > 0) & \
        (df_np[:, 7] > 0) & \
        (df_np[:, 8] > 0) & \
        (df_np[:, 9] > 0)


@nb.njit
def test3(df_np):
    truth_arr = np.full(df_np.shape[0], True)
    truth_arr[truth_arr] &= (df_np[:, 5][truth_arr] > 0)
    truth_arr[truth_arr] &= (df_np[:, 6][truth_arr] > 0)
    truth_arr[truth_arr] &= (df_np[:, 7][truth_arr] > 0)
    truth_arr[truth_arr] &= (df_np[:, 8][truth_arr] > 0)
    truth_arr[truth_arr] &= (df_np[:, 9][truth_arr] > 0)
    return truth_arr


@nb.njit(parallel=True)
def test4(df_np):
    n, m = df_np.shape
    truth_arr = np.empty(n, dtype=np.bool_)
    for i in nb.prange(n):
        truth_arr[i] = df_np[i, 5] > 0 and df_np[i, 6] > 0 and df_np[i, 7] > 0 and df_np[i, 8] > 0 and df_np[i, 9] > 0
    return truth_arr


@nb.njit
def test5(df_np):
    n, m = df_np.shape
    truth_arr = np.empty(n, dtype=np.bool_)
    for i in range(n):
        truth_arr[i] = df_np[i, 5] > 0
        for j in range(6, 10):
            if truth_arr[i]:
                truth_arr[i] &= df_np[i, j] > 0
            else:
                break
    return truth_arr


@nb.njit(parallel=True)
def test6(df_np):
    n, m = df_np.shape
    truth_arr = np.empty(n, dtype=np.bool_)
    for i in nb.prange(n):
        truth_arr[i] = df_np[i, 5] > 0
        for j in range(6, 10):
            if truth_arr[i]:
                truth_arr[i] &= df_np[i, j] > 0
            else:
                break
    return truth_arr


@nb.njit
def test7(df_np):
    n, m = df_np.shape
    truth_arr = np.empty(n, dtype=np.bool_)
    for i in range(n):
        truth_arr[i] = df_np[i, 5] > 0 and df_np[i, 6] > 0 and df_np[i, 7] > 0 and df_np[i, 8] > 0 and df_np[i, 9] > 0
    return truth_arr


@nb.njit
def test8(df_np):
    truth_arr = df_np[:, 5] > 0
    for j in range(6, 10):
        truth_arr &= (df_np[:, 6] > 0)
    return truth_arr


def test9(df_np):
    truth_arr = df_np[:, 5] > 0
    for j in range(6, 10):
        truth_arr &= (df_np[:, 6] > 0)
    return truth_arr

Which one will come out faster will depend on your environment.哪个会更快出来取决于您的环境。 I recommend trying them out yourself.我建议您自己尝试一下。

Timings on a Google Colab notebook look like:谷歌 Colab 笔记本上的时序如下所示:

for func in funcs:
    func(df_np)  # trigger compilation
    print(f"{func.__name__}  ", end="")
    %timeit -n 4 -r 4 func(df_np)
# test1  13.4 ms ± 711 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
# test2  8.39 ms ± 328 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
# test3  63.9 ms ± 585 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
# test4  9.38 ms ± 241 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
# test5  14.1 ms ± 245 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
# test6  8.81 ms ± 453 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
# test7  13.2 ms ± 274 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
# test8  33.8 ms ± 546 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)
# test9  34.9 ms ± 340 µs per loop (mean ± std. dev. of 4 runs, 4 loops each)

test2() comes out as slightly faster than test6() and test4() . test2()的输出速度比test6()test4()稍快。


Note that there was an issue with earlier implementation not returning the output, thus causing Numba to "optimize" everything away (dead-code elimination).请注意,早期实现存在一个问题,即未返回 output,从而导致 Numba 将所有内容“优化”掉(死代码消除)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM