Pandas数据帧：返回最大值的行和列

Question

我有一个数据帧，其中所有值都是相同的变量（例如相关矩阵 - 但我们期望唯一的最大值）。 我想返回此矩阵最大值的行和列。

我可以通过更改第一个参数来获得跨行或列的最大值

df.idxmax()

但是我还没有找到一种合适的方法来返回整个数据帧的最大行/列索引。

例如，我可以在numpy中执行此操作：

>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))

但是当我在熊猫中尝试类似的东西时：

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
    a   b   c
d NaN NaN NaN
e NaN   9 NaN
f NaN NaN NaN

在第二级， 我真正想做的是返回前n个值的行和列 ，例如作为一个系列。

例如，对于上面我喜欢的功能：

>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series

甚至只是

>>>topn(df,3)
(['b','c','b'],['e','f','f'])

一个la numpy.where（）

Answer 1

你想要使用的是stack

df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)

e  b    9
f  c    8
   b    7
   a    6
dtype: int64

Answer 2

我想出了第一部分：

npa = df.as_matrix()   
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx])

现在我需要一种方法来获得前n个。 一个天真的想法是复制数组，并随着迭代使用NaN抓取索引替换顶部值。 似乎效率低下。 有没有更好的方法来获得numpy数组的前n个值？ 幸运的是，如这里有，通过argpartition ，但我们必须使用扁平索引。

def topn(df,n):
    npa = df.as_matrix()   
    topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
    topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
    cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
    return ([df.columns[c] for c in cols],[df.index[i] for i in indx])

试试这个例子：

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])

如预期的。 请注意，排序最初并未被要求，但如果n不大则提供很少的开销。

Answer 3

我想你想要做的是DataFrame可能不是最好的选择，因为DataFrame中列的想法是保存独立数据。

>>> def topn(df,n):
       # pull the data ouit of the DataFrame
       # and flatten it to an array
       vals = df.values.flatten(order='F')
       # next we sort the array and store the sort mask
       p = np.argsort(vals)
       # create two arrays with the column names and indexes
       # in the same order as vals
       cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
       idxs = np.array([list(df.index) for idx in df.index]).flatten()
       # sort and return cols, and idxs
       return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]

>>> topn(df,3)
(array(['b', 'c', 'b'], 
      dtype='|S1'),
 array(['e', 'f', 'f'], 
      dtype='|S1'))


>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop

watsonics解决方案需要的少一点

%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop

但比堆栈快

def topStack(df,n):
    df = df.stack()
    df.sort(ascending=False)
    return df.head(n)

 %timeit(topStack(df,3))
 1000 loops, best of 3: 1.91 ms per loop

Pandas数据帧：返回最大值的行和列

问题描述

3 个解决方案

解决方案1
4 2014-11-12 14:05:01

解决方案2
3 2014-11-12 09:50:40

解决方案3
1 2014-11-12 07:17:16

Pandas数据帧：返回最大值的行和列

问题描述

3 个解决方案

解决方案1 4 2014-11-12 14:05:01

解决方案2 3 2014-11-12 09:50:40

解决方案3 1 2014-11-12 07:17:16

解决方案1
4 2014-11-12 14:05:01

解决方案2
3 2014-11-12 09:50:40

解决方案3
1 2014-11-12 07:17:16