简体   繁体   English

Pandas数据帧:返回最大值的行和列

[英]Pandas dataframe: return row AND column of maximum value(s)

I have a dataframe in which all values are of the same variety (eg a correlation matrix -- but where we expect a unique maximum). 我有一个数据帧,其中所有值都是相同的变量(例如相关矩阵 - 但我们期望唯一的最大值)。 I'd like to return the row and the column of the maximum of this matrix. 我想返回此矩阵最大值的行和列。

I can get the max across rows or columns by changing the first argument of 我可以通过更改第一个参数来获得跨行列的最大值

df.idxmax()

however I haven't found a suitable way to return the row/column index of the max of the whole dataframe. 但是我还没有找到一种合适的方法来返回整个数据帧的最大行/列索引。

For example, I can do this in numpy: 例如,我可以在numpy中执行此操作:

>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))

But when I try something similar in pandas: 但是当我在熊猫中尝试类似的东西时:

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
    a   b   c
d NaN NaN NaN
e NaN   9 NaN
f NaN NaN NaN

At a second level, what I acutally want to do is to return the rows and columns of the top n values , eg as a Series. 在第二级, 我真正想做的是返回前n个值的行和列 ,例如作为一个系列。

Eg for the above I'd like a function which does: 例如,对于上面我喜欢的功能:

>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series

or even just 甚至只是

>>>topn(df,3)
(['b','c','b'],['e','f','f'])

a la numpy.where() 一个la numpy.where()

what you want to use is stack 你想要使用的是stack

df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)

e  b    9
f  c    8
   b    7
   a    6
dtype: int64

I figured out the first part: 我想出了第一部分:

npa = df.as_matrix()   
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx]) 

Now I need a way to get the top n. 现在我需要一种方法来获得前n个。 One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. 一个天真的想法是复制数组,并随着迭代使用NaN抓取索引替换顶部值。 Seems inefficient. 似乎效率低下。 Is there a better way to get the top n values of a numpy array? 有没有更好的方法来获得numpy数组的前n个值? Fortunately, as shown here there is, through argpartition , but we have to use flattened indexing. 幸运的是,如这里有,通过argpartition ,但我们必须使用扁平索引。

def topn(df,n):
    npa = df.as_matrix()   
    topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
    topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
    cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
    return ([df.columns[c] for c in cols],[df.index[i] for i in indx]) 

Trying this on the example: 试试这个例子:

>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])

As desired. 如预期的。 Mind you the sorting was not originally asked for, but provides little overhead if n is not large. 请注意,排序最初并未被要求,但如果n不大则提供很少的开销。

I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data. 我想你想要做的是DataFrame可能不是最好的选择,因为DataFrame中列的想法是保存独立数据。

>>> def topn(df,n):
       # pull the data ouit of the DataFrame
       # and flatten it to an array
       vals = df.values.flatten(order='F')
       # next we sort the array and store the sort mask
       p = np.argsort(vals)
       # create two arrays with the column names and indexes
       # in the same order as vals
       cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
       idxs = np.array([list(df.index) for idx in df.index]).flatten()
       # sort and return cols, and idxs
       return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]

>>> topn(df,3)
(array(['b', 'c', 'b'], 
      dtype='|S1'),
 array(['e', 'f', 'f'], 
      dtype='|S1'))


>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop

watsonics solution takes slightly less watsonics解决方案需要的少一点

%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop

but way faster than stack 但比堆栈快

def topStack(df,n):
    df = df.stack()
    df.sort(ascending=False)
    return df.head(n)

 %timeit(topStack(df,3))
 1000 loops, best of 3: 1.91 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python中的逻辑根据数据帧中某列的最大值仅返回相似行中的一行 - Logic in Python to return only one row among similar row(s) based on the maximum value in a column in a dataframe 如何在 Pandas 中的串联 dataframe 中找到列/行组合的最大值 - How to find the maximum value of a column/row combination in a concatenated dataframe in Pandas 从Pandas DataFrame获取最大值的行索引和列索引 - Get the row index and column index of maximum value from Pandas DataFrame 使用 Pandas 查找列的最大值并返回相应的行值 - Find maximum value of a column and return the corresponding row values using Pandas 熊猫:对于groupby value_counts,返回具有最大计数的行 - Pandas: for groupby value_counts, return the row(s) with the maximum count 熊猫数据框返回列标题链接到每一行的数据值 - Pandas dataframe return column header linked to data value for each row 提取组中具有最大值的行 pandas dataframe - Extract row with maximum value in a group pandas dataframe Pandas Dataframe基于前一行,将值添加到新列,但该列的最大值限于该列 - Pandas Dataframe Add a value to a new Column based on the previous row limited to the maximum value in that column 返回数据框中两列的最大值(Pandas) - Return the maximum value of two columns in a dataframe (Pandas) (行、列):值到 Pandas DataFrame - (Row, Column) : Value to Pandas DataFrame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM