Pandas数据帧：返回最大值的行和列

Question

I have a dataframe in which all values are of the same variety (eg a correlation matrix -- but where we expect a unique maximum). 我有一个数据帧，其中所有值都是相同的变量（例如相关矩阵 - 但我们期望唯一的最大值）。 I'd like to return the row and the column of the maximum of this matrix. 我想返回此矩阵最大值的行和列。

I can get the max across rows or columns by changing the first argument of 我可以通过更改第一个参数来获得跨行或列的最大值

df.idxmax()

however I haven't found a suitable way to return the row/column index of the max of the whole dataframe. 但是我还没有找到一种合适的方法来返回整个数据帧的最大行/列索引。

For example, I can do this in numpy: 例如，我可以在numpy中执行此操作：

>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))

But when I try something similar in pandas: 但是当我在熊猫中尝试类似的东西时：

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
    a   b   c
d NaN NaN NaN
e NaN   9 NaN
f NaN NaN NaN

At a second level, what I acutally want to do is to return the rows and columns of the top n values , eg as a Series. 在第二级， 我真正想做的是返回前n个值的行和列 ，例如作为一个系列。

Eg for the above I'd like a function which does: 例如，对于上面我喜欢的功能：

>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series

or even just 甚至只是

>>>topn(df,3)
(['b','c','b'],['e','f','f'])

a la numpy.where() 一个la numpy.where（）

Answer 1

what you want to use is stack 你想要使用的是stack

df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)

e  b    9
f  c    8
   b    7
   a    6
dtype: int64

Answer 2

I figured out the first part: 我想出了第一部分：

npa = df.as_matrix()   
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx])

Now I need a way to get the top n. 现在我需要一种方法来获得前n个。 One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. 一个天真的想法是复制数组，并随着迭代使用NaN抓取索引替换顶部值。 Seems inefficient. 似乎效率低下。 Is there a better way to get the top n values of a numpy array? 有没有更好的方法来获得numpy数组的前n个值？ Fortunately, as shown here there is, through argpartition , but we have to use flattened indexing. 幸运的是，如这里有，通过argpartition ，但我们必须使用扁平索引。

def topn(df,n):
    npa = df.as_matrix()   
    topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
    topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
    cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
    return ([df.columns[c] for c in cols],[df.index[i] for i in indx])

Trying this on the example: 试试这个例子：

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])

As desired. 如预期的。 Mind you the sorting was not originally asked for, but provides little overhead if n is not large. 请注意，排序最初并未被要求，但如果n不大则提供很少的开销。

Answer 3

I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data. 我想你想要做的是DataFrame可能不是最好的选择，因为DataFrame中列的想法是保存独立数据。

>>> def topn(df,n):
       # pull the data ouit of the DataFrame
       # and flatten it to an array
       vals = df.values.flatten(order='F')
       # next we sort the array and store the sort mask
       p = np.argsort(vals)
       # create two arrays with the column names and indexes
       # in the same order as vals
       cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
       idxs = np.array([list(df.index) for idx in df.index]).flatten()
       # sort and return cols, and idxs
       return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]

>>> topn(df,3)
(array(['b', 'c', 'b'], 
      dtype='|S1'),
 array(['e', 'f', 'f'], 
      dtype='|S1'))


>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop

watsonics solution takes slightly less watsonics解决方案需要的少一点

%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop

but way faster than stack 但比堆栈快

def topStack(df,n):
    df = df.stack()
    df.sort(ascending=False)
    return df.head(n)

 %timeit(topStack(df,3))
 1000 loops, best of 3: 1.91 ms per loop

Pandas数据帧：返回最大值的行和列

问题描述

3 个解决方案

解决方案1
4 2014-11-12 14:05:01

解决方案2
3 2014-11-12 09:50:40

解决方案3
1 2014-11-12 07:17:16

Pandas数据帧：返回最大值的行和列

问题描述

3 个解决方案

解决方案1 4 2014-11-12 14:05:01

解决方案2 3 2014-11-12 09:50:40

解决方案3 1 2014-11-12 07:17:16

解决方案1
4 2014-11-12 14:05:01

解决方案2
3 2014-11-12 09:50:40

解决方案3
1 2014-11-12 07:17:16