简体   繁体   English

使用 Numpy 和 Numba 将一组值分箱到离散集中最接近的值

[英]Binning an array of values to the closest value in a discrete set using Numpy & Numba

I've got a function below which takes in an array of floats as well as an array of discrete integers.我在下面有一个 function,它接收一个浮点数数组和一个离散整数数组。 For all of the floats, I want them to be rounded to the closest integer in the list.对于所有浮点数,我希望它们四舍五入到列表中最接近的 integer。

The below function works perfectly, where sHatV is an array of 10,000 floats and possible_locations is an array of 5 integers:下面的 function 完美运行,其中 sHatV 是一个包含 10,000 个浮点数的数组,而 possible_locations 是一个包含 5 个整数的数组:

binnedV = [min(possible_locations, key=lambda x:abs(x-bv)) for bv in sHatV]

As this function is going to be called thousands of time I'm trying to use the @numba.njit decorator to minimize computation time.由于这个 function 将被调用数千次,我正在尝试使用@numba.njit装饰器来最小化计算时间。

I thought about using np.digitize in my 'numbafied' function but it rounds values out of bounds to zeros.我考虑过在我的“numbafied”function 中使用np.digitize ,但它会将超出范围的值四舍五入为零。 I want everything to be binned to one of the values in possible locations.我希望所有内容都被合并到可能位置的值之一。

Overall, I need to write a numba compatible function which takes every value in the first array of length N, finds the closest value to it in array 2, and return that closest value, culminating in an array of length N with the binned values.总的来说,我需要编写一个 numba 兼容的 function,它获取第一个长度为 N 的数组中的每个值,在数组 2 中找到最接近它的值,并返回最接近的值,最终形成一个长度为 N 的数组,其中包含分箱值。

Any help is appreciated!任何帮助表示赞赏!

Here's a version that runs much faster, and is probably more "numbifiable" since it uses numpy functions instead of the implicit for loop of a list comprehension:这是一个运行速度更快的版本,并且可能更“可计算”,因为它使用 numpy 函数而不是列表理解的隐式 for 循环:

import numpy as np

sHatV = [0.33, 4.18, 2.69]
possible_locations = np.array([0, 1, 2, 3, 4, 5])

diff_matrix = np.subtract.outer(sHatV, possible_locations)
idx = np.abs(diff_matrix).argmin(axis=1)
result = possible_locations[idx]

print(result)
# output: [0 4 3]

The idea here is to calculate a difference matrix between sHatv and possible_locations .这里的想法是计算sHatvpossible_locations之间的差异矩阵。 In this particular example, that matrix is:在此特定示例中,该矩阵是:

array([[ 0.33, -0.67, -1.67, -2.67, -3.67, -4.67],
       [ 4.18,  3.18,  2.18,  1.18,  0.18, -0.82],
       [ 2.69,  1.69,  0.69, -0.31, -1.31, -2.31]])

Then, with np.abs(... ).argmin(axis=1) , we find the index of each row where the absolute difference is minimal.然后,使用np.abs(... ).argmin(axis=1) ,我们找到绝对差异最小的每一行的索引。 If we index the original possible_locations array by these indices, we get to the answer.如果我们通过这些索引对原始possible_locations数组进行索引,我们就会得到答案。

Comparing the runtime:比较运行时:

using list comprehension使用列表理解

def f(possible_locations, sHatV):
    return [min(possible_locations, key=lambda x:abs(x-bv)) for bv in sHatV]


def test_f():
    possible_locations = np.array([0, 1, 2, 3, 4, 5])
    sHatV = np.random.uniform(0.1, 4.9, size=10_000)
    f(possible_locations, sHatV)


%timeit test_f()
# 187 ms ± 7.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

using difference matrix使用差异矩阵

def g(possible_locations, sHatV):
    return possible_locations[np.abs(np.subtract.outer(sHatV, bins)).argmin(axis=1)]


def test_g():
    possible_locations = np.array([0, 1, 2, 3, 4, 5])
    sHatV = np.random.uniform(0.1, 4.9, size=10_000)
    g(possible_locations, sHatV)

%timeit test_g()
# 556 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I would suggest sticking to numpy for this.为此,我建议坚持使用 numpy。 The digitize function is close to what you need but requires a bit of modification: digitize function 接近您的需要,但需要进行一些修改:

  • implement rounding logic instead of floor/ceil实现舍入逻辑而不是 floor/ceil
  • account for endpoint issues.考虑端点问题。 The documentation says: If values in `x` are beyond the bounds of `bins`, 0 or ``len(bins)`` is returned as appropriate.文档说: If values in `x` are beyond the bounds of `bins`, 0 or ``len(bins)`` is returned as appropriate.

Here's an example:这是一个例子:

import numpy as np
sHatV = np.array([-99, 1.4999, 1.5, 3.1, 3.9, 99.5, 1000])
bins = np.arange(0,101)

def custom_round(arr, bins):
    bin_centers = (bins[:-1] + bins[1:])/2 
    idx = np.digitize(sHatV, bin_centers)
    result = bins[idx]
    return result

assert np.all(custom_round(sHatV, bins) == np.array([0, 1, 2, 3, 4, 100, 100]))

And now my favorite part: how fast is numpy at this?现在我最喜欢的部分是:numpy 的速度有多快? I won't do scaling, we'll just pick large arrays:我不会做缩放,我们只会选择大的 arrays:

sHatV = 10009*np.random.random(int(1e6))
bins = np.arange(10000)

%timeit custom_round(sHatV, bins)
# on a laptop: 100 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM