简体   繁体   English

获取二维数组的行最大值的列索引(随机平局)

[英]Get column indices of row-wise maximum values of a 2D array (with random tie-breaking)

Given a 2D numpy array, I want to construct an array out of the column indices of the maximum value of each row.给定一个 2D numpy 数组,我想用每行最大值的列索引构造一个数组。 So far, arr.argmax(1) works well.到目前为止, arr.argmax(1)运行良好。 However, for my specific case, for some rows, 2 or more columns may contain the maximum value.但是,对于我的具体情况,对于某些行,2 列或更多列可能包含最大值。 In that case, I want to select a column index randomly (not the first index as it is the case with .argmax(1) ).在那种情况下,我想 select 随机列索引(不是第一个索引,因为它是.argmax(1)的情况)。

For example, for the following arr :例如,对于以下arr

arr = np.array([
    [0, 1, 0],
    [1, 1, 0],
    [2, 1, 3],
    [3, 2, 2]
])

there can be two possible outcomes: array([1, 0, 2, 0]) and array([1, 1, 2, 0]) each chosen with 1/2 probability.可能有两种可能的结果: array([1, 0, 2, 0])array([1, 1, 2, 0])每个结果都有 1/2 的概率。

I have code that returns the expected output using a list comprehension:我有使用列表理解返回预期 output 的代码:

idx = np.arange(arr.shape[1])
ans = [np.random.choice(idx[ix]) for ix in arr == arr.max(1, keepdims=True)]

but I'm looking for an optimized numpy solution.但我正在寻找优化的 numpy 解决方案。 In other words, how do I replace the list comprehension with numpy methods to make the code feasible for bigger arrays?换句话说,我如何用 numpy 方法替换列表理解以使代码适用于更大的 arrays?

Use scipy.stats.rankdata and apply_along_axis as follows.使用scipy.stats.rankdataapply_along_axis如下。

import numpy as np
from scipy.stats import rankdata
ranks = rankdata(-arr, axis = 1, method = "min")
func = lambda x: np.random.choice(np.where(x==1)[0])
idx = np.apply_along_axis(func, 1, ranks)

print(idx)

It returns [1 0 2 0] or [1 1 2 0].它返回 [1 0 2 0] 或 [1 1 2 0]。

The main idea is rankdata calculates ranks of every value in each row, and the maximum value will have 1. func randomly choices one of index whose corresponding value is 1. Finally, apply_along_axis applies the func to every row of arr .主要思想是rankdata计算每一行中每个值的排名,最大值为 1。 func随机选择一个对应值为 1 的索引。最后, apply_along_axisfunc应用于arr的每一行。

After some advice I got offline, it turns out that randomization of maximum values are possible when we multiply the boolean array that flags row-wise maximum values by a random array of the same shape.在我离线获得一些建议后,事实证明,当我们将标记行方向最大值的 boolean 数组乘以相同形状的随机数组时,最大值的随机化是可能的。 Then what remains is a simple argmax(1) call.然后剩下的就是一个简单的argmax(1)调用。

# boolean array that flags maximum values of each row
mxs = arr == arr.max(1, keepdims=True)
# random array where non-maximum values are zero and maximum values are random values
random_arr = np.random.rand(*arr.shape) * mxs
# row-wise maximum of the auxiliary array
ans = random_arr.argmax(1)

A timeit test shows that for data of shape (507_563, 12) , this code runs in ~172 ms on my machine while the loop in the question runs for 11 sec, so this is about 63x faster. timeit 测试表明,对于形状为(507_563, 12)的数据,这段代码在我的机器上运行时间约为 172 毫秒,而问题中的循环运行时间为 11 秒,因此速度提高了大约 63 倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM