[英]Pandas dataframe: return row AND column of maximum value(s)
I have a dataframe in which all values are of the same variety (eg a correlation matrix -- but where we expect a unique maximum). 我有一个数据帧,其中所有值都是相同的变量(例如相关矩阵 - 但我们期望唯一的最大值)。 I'd like to return the row and the column of the maximum of this matrix.
我想返回此矩阵最大值的行和列。
I can get the max across rows or columns by changing the first argument of 我可以通过更改第一个参数来获得跨行或列的最大值
df.idxmax()
however I haven't found a suitable way to return the row/column index of the max of the whole dataframe. 但是我还没有找到一种合适的方法来返回整个数据帧的最大行/列索引。
For example, I can do this in numpy: 例如,我可以在numpy中执行此操作:
>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))
But when I try something similar in pandas: 但是当我在熊猫中尝试类似的东西时:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
a b c
d NaN NaN NaN
e NaN 9 NaN
f NaN NaN NaN
At a second level, what I acutally want to do is to return the rows and columns of the top n values , eg as a Series. 在第二级, 我真正想做的是返回前n个值的行和列 ,例如作为一个系列。
Eg for the above I'd like a function which does: 例如,对于上面我喜欢的功能:
>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series
or even just 甚至只是
>>>topn(df,3)
(['b','c','b'],['e','f','f'])
a la numpy.where() 一个la numpy.where()
what you want to use is stack
你想要使用的是
stack
df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)
e b 9
f c 8
b 7
a 6
dtype: int64
I figured out the first part: 我想出了第一部分:
npa = df.as_matrix()
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx])
Now I need a way to get the top n. 现在我需要一种方法来获得前n个。 One naive idea is to copy the array, and iteratively replace the top values with
NaN
grabbing index as you go. 一个天真的想法是复制数组,并随着迭代使用
NaN
抓取索引替换顶部值。 Seems inefficient. 似乎效率低下。 Is there a better way to get the top n values of a numpy array?
有没有更好的方法来获得numpy数组的前n个值? Fortunately, as shown here there is, through
argpartition
, but we have to use flattened indexing. 幸运的是,如这里有,通过
argpartition
,但我们必须使用扁平索引。
def topn(df,n):
npa = df.as_matrix()
topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
return ([df.columns[c] for c in cols],[df.index[i] for i in indx])
Trying this on the example: 试试这个例子:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])
As desired. 如预期的。 Mind you the sorting was not originally asked for, but provides little overhead if
n
is not large. 请注意,排序最初并未被要求,但如果
n
不大则提供很少的开销。
I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data. 我想你想要做的是DataFrame可能不是最好的选择,因为DataFrame中列的想法是保存独立数据。
>>> def topn(df,n):
# pull the data ouit of the DataFrame
# and flatten it to an array
vals = df.values.flatten(order='F')
# next we sort the array and store the sort mask
p = np.argsort(vals)
# create two arrays with the column names and indexes
# in the same order as vals
cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
idxs = np.array([list(df.index) for idx in df.index]).flatten()
# sort and return cols, and idxs
return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]
>>> topn(df,3)
(array(['b', 'c', 'b'],
dtype='|S1'),
array(['e', 'f', 'f'],
dtype='|S1'))
>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop
watsonics solution takes slightly less watsonics解决方案需要的少一点
%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop
but way faster than stack 但比堆栈快
def topStack(df,n):
df = df.stack()
df.sort(ascending=False)
return df.head(n)
%timeit(topStack(df,3))
1000 loops, best of 3: 1.91 ms per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.