简体   繁体   English

如何使用 numpy 向量化解决这个问题

[英]How to solve this using numpy vectorization

I have a really big input numpy array, and a dictionary.我有一个非常大的输入 numpy 数组和一个字典。 The dictionary dictates what the values in the numpy array should be updated to.字典规定了 numpy 数组中的值应该更新为什么。 I can do it using a for loop but it is very time consuming, can I use numpy vectorization to solve this?我可以使用 for 循环来做到这一点,但它非常耗时,我可以使用 numpy 矢量化来解决这个问题吗?

Input:输入:

arr_to_check = numpy.array([['A', 20],['B', 100],['C', 80],['D', 90], ['E', 100]]) # actual length is ~10^8
max_possible = {'A': 25, 'B': 40, 'C': 90, 'D': 50, 'F': 100, 'G': 90} # actual length is ~10^3

Expexted Result:预期结果:

[['A', '20'], # do not change, because 20 < 25 --- max possible for 'A' is 25.
['B', '0'], # change to 0, because 100 > 50 --- max possible for 'B' is 40.
['C', '80'], # do not change, because 80 < 90
['D', '0'], # change to 0, because 90 > 50 --- max possible for 'D' is 50.
['E', '100' ]] 

Here is the loop solution:这是循环解决方案:

for i in range(arr_to_check.shape[0]):
    row = arr_to_check[i]
    if row[0] in max_possible and int(row[1]) > max_possible[row[0]]:
        row[1] = 0

Here is a way to do what you've asked ( UPDATED to simplify the code).这是一种执行您所要求的方法(已更新以简化代码)。

A few notes first:先说几点:

  • numpy arrays must be of homogeneous type, so the numbers you show in your question will be converted by numpy to strings to match the data type of the labels (if pandas is an option, it might allow you to have columns of numbers co-exist with distinct columns of strings). numpy 数组必须是同质类型,因此您在问题中显示的数字将由 numpy 转换为字符串以匹配标签的数据类型(如果 pandas 是一个选项,它可能允许您让数字列共存具有不同的字符串列)。
  • Though I have taken the result all the way through to match the original homogeneous data type (string), you can stop early and use the intermediate 1D numerical results if that's all you need.虽然我已经将结果一路匹配以匹配原始同类数据类型(字符串),但如果您需要的话,您可以提前停止并使用中间的一维数值结果。
  • I have used int as the numeric type, and you can change this to float if required.我使用int作为数字类型,如果需要,您可以将其更改为float
import numpy
arr_to_check = numpy.array([['A', 20],['B', 100],['C', 80],['D', 90], ['E', 100]])
max_possible = {'A': 25, 'B': 40, 'C': 90, 'D': 50, 'F': 100, 'G': 90}
print('arr_to_check:'); print(arr_to_check)

aT = arr_to_check.T
labels = aT[0,:]
values = aT[1,:].astype(int)
print('labels:'); print(labels)
print('values:'); print(values)

for label, value in max_possible.items():
    curMask = (labels == label)
    values[curMask] *= (values[curMask] <= value)
print('values:'); print(values)

aT[1,:] = values
arr_to_check = aT.T
print('arr_to_check:'); print(arr_to_check)

Input:输入:

arr_to_check:
[['A' '20']
 ['B' '100']
 ['C' '80']
 ['D' '90']
 ['E' '100']]

Output:输出:

labels:
['A' 'B' 'C' 'D' 'E']
values:
[ 20 100  80  90 100]
values:
[ 20   0  80   0 100]
arr_to_check:
[['A' '20']
 ['B' '0']
 ['C' '80']
 ['D' '0']
 ['E' '100']]

Explanation:解释:

  • Transpose the input so that we can use vectorized operations directly on the numeric vector ( values ).转置输入,以便我们可以直接对数值向量 ( values ) 使用向量化操作。
  • Iterate over each key/value pair in max_possible and use a vectorized formula to multiply values by 0 if the value in max_possible has been breached for rows whose label (in labels ) matches the key in max_possible .迭代max_possible中的每个键/值对,并使用矢量化公式将values乘以 0,如果max_possible中的值已被破坏其标签(在labels中)与max_possible中的键匹配的行。
  • Update the original numpy array using values .使用values更新原始 numpy 数组。

As others have pointed out that numpy arrays are homogeneous, your output elements will all have str.正如其他人指出的那样,numpy 数组是同质的,您的输出元素都将具有 str。 If that is ok, you can use apply_along_axis :如果没问题,您可以使用apply_along_axis

t = lambda x: [x[0],0] if  x[0] in max_possible and int(x[1]) > max_possible[x[0]] else x
numpy.apply_along_axis(t, 1, arr_to_check)

As other said, you should use only numbers in your numpy array.正如其他人所说,你应该只在你的 numpy 数组中使用数字。 So you could have your data like this:所以你可以有这样的数据:

arr_to_check = np.array([[0, 20],[1, 100],[2, 80],[3, 90], [4, 100]])
max_possible = np.array([25, 40, 90, 50, np.inf, 100, 90])

Here I have assumed 'A': 0, 'B': 1, ... Note that this way, not only strings have been replaced by numbers, but dict has also been replaced by a Numpy array where max_possible[i] is max for i-th string, facilitating subsequent operations.这里我假设 'A': 0, 'B': 1, ... 请注意,这样,不仅字符串已被数字替换,而且 dict 也已被 max_possible[i] 为 max 的 Numpy 数组替换对于第i个字符串,方便后续操作。

Now, you obtain what you want with:现在,您可以通过以下方式获得所需的内容:

m = max_possible.take(arr_to_check.T[0]) 
m1 = np.array([arr_to_check.T[0], np.minimum(arr_to_check.T[1], m)]) 
m1.T
  • 1st line puts in m the max value of each key.第一行放入 m 每个键的最大值。

  • 2nd line puts in m1 your keys as first row, and min of your values and max of each key第二行将 m1 您的键作为第一行,您的值的最小值和每个键的最大值

  • 3rd row transposes as your result:第三行转置为您的结果:

    array([[ 0., 20.], [ 1., 40.], [ 2., 80.], [ 3., 50.], [ 4., 100.]])数组([[0., 20.], [1., 40.], [2., 80.], [3., 50.], [4., 100.]])

Running your code:运行您的代码:

In [362]: %%timeit arr = arr_to_check.copy()
     ...: for i in range(arr.shape[0]):
     ...:     row = arr[i]
     ...:     if row[0] in max_possible and int(row[1]) > max_possible[row[0]]:
     ...:         row[1] = 0
     ...:         
14.1 µs ± 203 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Iterating like this on an array is slower than working with lists, so lets try a pure list solution:像这样在数组上迭代比使用列表要慢,所以让我们尝试一个纯列表解决方案:

In [372]: alist_to_check = [['A', 20],['B', 100],['C', 80],['D', 90], ['E', 100]]
     ...: max_possible = {'A': 25, 'B': 40, 'C': 90, 'D': 50, 'F': 100, 'G': 90}

Using a list comprehension with an if/else expression:使用带有 if/else 表达式的列表推导:

In [373]: [[k,0] if k in max_possible and v>max_possible[k] else [k,v] for k,v in alist_to_check]
Out[373]: [['A', 20], ['B', 0], ['C', 80], ['D', 0], ['E', 100]]

In [374]: timeit [[k,0] if k in max_possible and v>max_possible[k] else [k,v] for k,v in alist_to_check]
1.45 µs ± 3.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

One of the answers suggested a apply_along_axis - with the keys redefine at integer.其中一个答案建议使用apply_along_axis - 将键重新定义为整数。 My timing came at我的时机来了

In [366]: timeit np.apply_along_axis(t, 1, arr_to_check)
108 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

For a small example the pure list approach is fastest.举一个小例子,纯列表方法是最快的。 For really large case we can probably cast it as a numpy problem that scales better, but I haven't looked at those options.对于非常大的情况,我们可能会将其视为一个可扩展的 numpy 问题,但我没有看过这些选项。

with structured array结构化数组

We could turn the list into a structured array.我们可以将列表转换为结构化数组。 This preserves the string and int dtypes:这保留了字符串和 int dtypes:

In [398]: arr = np.array([tuple(kv) for kv in alist_to_check],'U10,int')

In [399]: arr
Out[399]: 
array([('A',  20), ('B', 100), ('C',  80), ('D',  90), ('E', 100)],
      dtype=[('f0', '<U10'), ('f1', '<i4')])

In [400]: arr['f0']
Out[400]: array(['A', 'B', 'C', 'D', 'E'], dtype='<U10')

In [401]: arr['f1']
Out[401]: array([ 20, 100,  80,  90, 100])

If max_possible is small relative to the list, it could be most efficient to iterate on its items, and set the corresponding elements of the structured array.如果max_possible相对于列表较小,则迭代其项目并设置结构化数组的相应元素可能是最有效的。 For example:例如:

def foo(alist):
    arr = np.array([tuple(kv) for kv in alist],'U10,int')
    for k,v in max_possible.items():
        idx = np.nonzero((arr['f0']==k) & (arr['f1']>v))[0]
        arr['f1'][idx] = 0
    return arr

In [395]: foo(alist_to_check)
Out[395]: 
array([('A',  20), ('B',   0), ('C',  80), ('D',   0), ('E', 100)],
      dtype=[('f0', '<U10'), ('f1', '<i4')])

For this sample, the times aren't that great:对于这个示例,时间不是那么好:

In [397]: timeit foo(alist_to_check)
102 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

For a big list:对于一个大列表:

In [403]: biglist = alist_to_check*10000

In [409]: timeit foo(biglist)
44.1 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [410]: timeit [[k,0] if k in max_possible and v>max_possible[k] else [k,v] for k,v in biglist]
14.8 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Time still isn't that great.时间还没有那么好。 However a big chunk of that is in creating the structured array:然而,其中很大一部分是创建结构化数组:

In [411]: timeit arr = np.array([tuple(kv) for kv in biglist],'U10,int')
38.4 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If we already had the structured array, I expect the times to be much better.如果我们已经有了结构化数组,我希望时代会好很多。

Curiously, making a pure string dtype array from that biglist takes even longer:奇怪的是,从该大列表中创建一个纯字符串biglist数组需要更长的时间:

In [412]: timeit np.array(biglist)
74.2 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Still, this does make it clear that working a dict and string matching, lists remain competative with numpy solutions.尽管如此,这确实清楚地表明,使用dict和字符串匹配,列表仍然与numpy解决方案具有竞争力。 numpy is best for purely numeric work. numpy最适合纯数字工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM