Here is my pandas.DataFrame
:
import pandas as pd
data = pd.DataFrame({
'first': [40, 32, 56, 12, 89],
'second': [13, 45, 76, 19, 45],
'third': [98, 56, 87, 12, 67]
}, index = ['first', 'second', 'third', 'fourth', 'fifth'])
I want to create a new DataFrame
that will contain top 3 values from each column of my data
DataFrame
.
Here is an expected output:
first second third
0 89 76 98
1 56 45 87
2 40 45 67
How can I do that?
Create a function to return the top three values of a series:
def sorted(s, num):
tmp = s.sort_values(ascending=False)[:num] # earlier s.order(..)
tmp.index = range(num)
return tmp
Apply it to your data set:
In [1]: data.apply(lambda x: sorted(x, 3))
Out[1]:
first second third
0 89 76 98
1 56 45 87
2 40 45 67
With numpy you can get array of top-3 values along columns like follows:
>>> import numpy as np
>>> col_ind = np.argsort(data.values, axis=0)[::-1,:]
>>> ind_to_take = col_ind[:3,:] + np.arange(data.shape[1])*data.shape[0]
>>> np.take(data.values.T, ind_to_take)
array([[89, 76, 98],
[56, 45, 87],
[40, 45, 67]], dtype=int64)
You can convert back to DataFrame:
>>> pd.DataFrame(_, columns = data.columns, index=data.index[:3])
first second third
One 89 76 98
Two 56 45 87
Three 40 45 67
The other solutions (at the time of writing this), sort the DataFrame with super-linear complexity per column , but it can actually be done with linear time per column.
first, numpy.partition
partitions the k smallest elements at the k first positions (unsorted otherwise). To get the k largest elements, we can use
import numpy as np
-np.partition(-v, k)[: k]
Combining this with dictionary comprehension, we can use:
>>> pd.DataFrame({c: -np.partition(-data[c], 3)[: 3] for c in data.columns})
first second third
0 89 76 98
1 56 45 87
2 40 45 67
Alternative pandas solution:
In [6]: N = 3
In [7]: pd.DataFrame([df[c].nlargest(N).values.tolist() for c in df.columns],
...: index=df.columns,
...: columns=['{}_largest'.format(i) for i in range(1, N+1)]).T
...:
Out[7]:
first second third
1_largest 89 76 98
2_largest 56 45 87
3_largest 40 45 67
Use nlargest
like
In [1594]: pd.DataFrame({c: data[c].nlargest(3).values for c in data})
Out[1594]:
first second third
0 89 76 98
1 56 45 87
2 40 45 67
where
In [1603]: data
Out[1603]:
first second third
first 40 13 98
second 32 45 56
third 56 76 87
fourth 12 19 12
fifth 89 45 67
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.