Get top biggest values from each column of the pandas.DataFrame

Question

Here is my pandas.DataFrame :

import pandas as pd
data = pd.DataFrame({
  'first': [40, 32, 56, 12, 89],
  'second': [13, 45, 76, 19, 45],
  'third': [98, 56, 87, 12, 67]
}, index = ['first', 'second', 'third', 'fourth', 'fifth'])

I want to create a new DataFrame that will contain top 3 values from each column of my data DataFrame .

Here is an expected output:

   first  second  third
0     89      76     98
1     56      45     87
2     40      45     67

How can I do that?

Answer 1

Create a function to return the top three values of a series:

def sorted(s, num):
    tmp = s.sort_values(ascending=False)[:num]  # earlier s.order(..)
    tmp.index = range(num)
    return tmp

Apply it to your data set:

In [1]: data.apply(lambda x: sorted(x, 3))
Out[1]:
   first  second  third
0     89      76     98
1     56      45     87
2     40      45     67

Answer 2

With numpy you can get array of top-3 values along columns like follows:

>>> import numpy as np
>>> col_ind = np.argsort(data.values, axis=0)[::-1,:]
>>> ind_to_take = col_ind[:3,:] + np.arange(data.shape[1])*data.shape[0]
>>> np.take(data.values.T, ind_to_take)
array([[89, 76, 98],
       [56, 45, 87],
       [40, 45, 67]], dtype=int64)

You can convert back to DataFrame:

>>> pd.DataFrame(_, columns = data.columns, index=data.index[:3])
       first  second  third
One       89      76     98
Two       56      45     87
Three     40      45     67

Answer 3

The other solutions (at the time of writing this), sort the DataFrame with super-linear complexity per column , but it can actually be done with linear time per column.

first, numpy.partition partitions the k smallest elements at the k first positions (unsorted otherwise). To get the k largest elements, we can use

import numpy as np

-np.partition(-v, k)[: k]

Combining this with dictionary comprehension, we can use:

>>> pd.DataFrame({c: -np.partition(-data[c], 3)[: 3] for c in data.columns})
    first   second  third
0   89  76  98
1   56  45  87
2   40  45  67

Answer 4

Alternative pandas solution:

In [6]: N = 3

In [7]: pd.DataFrame([df[c].nlargest(N).values.tolist() for c in df.columns],
   ...:              index=df.columns,
   ...:              columns=['{}_largest'.format(i) for i in range(1, N+1)]).T
   ...:
Out[7]:
           first  second  third
1_largest     89      76     98
2_largest     56      45     87
3_largest     40      45     67

Answer 5

Use nlargest like

In [1594]: pd.DataFrame({c: data[c].nlargest(3).values for c in data})
Out[1594]:
   first  second  third
0     89      76     98
1     56      45     87
2     40      45     67

_where

In [1603]: data
Out[1603]:
        first  second  third
first      40      13     98
second     32      45     56
third      56      76     87
fourth     12      19     12
fifth      89      45     67

Get top biggest values from each column of the pandas.DataFrame

Question

5 answers

solution1
9 ACCPTED 2013-12-09 18:25:43

solution2
3 2013-12-09 18:14:42

solution3
1 2015-05-27 00:39:06

solution4
0 2016-10-16 19:21:55

solution5
0

Get top biggest values from each column of the pandas.DataFrame

Question

5 answers

solution1 9 ACCPTED 2013-12-09 18:25:43

solution2 3 2013-12-09 18:14:42

solution3 1 2015-05-27 00:39:06

solution4 0 2016-10-16 19:21:55

solution5 0

solution1
9 ACCPTED 2013-12-09 18:25:43

solution2
3 2013-12-09 18:14:42

solution3
1 2015-05-27 00:39:06

solution4
0 2016-10-16 19:21:55

solution5
0