I have a Pandas DataFrame with columns like this:
col1 col2 col3 col4 col5
a a1 foo1 foo2 foo3
b b1 foo4 foo5 foo6
c c1 foo7 foo8 foo9
a a2 foo10 foo11 foo12
a a3 foo13 foo14 foo15
b b2 foo16 foo17 foo18
I would like to sort the rows (the entire rows) of this dataframe based on descending frequency of values in col1
and then for other columns in the row, get one of the rows that have that value in their column (similar to SQL group by). How can I do that in Pandas? I believe this is some combination of groupby and sort_values, but I'm not exactly sure how to do it.
For the above example, a
is the most frequent value in col1
, followed by b
and c
. So I would like the first row of the resulting dataframe to be one of the rows with col1
value being a
. The next row should be one of the two rows with value b
. And the last row is the only row having value c
.
So this is one answer:
col1 col2 col3 col4 col5
a a1 foo1 foo2 foo3
b b1 foo4 foo5 foo6
c c1 foo7 foo8 foo9
but so is this one:
col1 col2 col3 col4 col5
a a3 foo13 foo14 foo15
b b1 foo4 foo5 foo6
c c1 foo7 foo8 foo9
And this one:
col1 col2 col3 col4 col5
a a2 foo10 foo11 foo12
b b2 foo16 foo17 foo18
c c1 foo7 foo8 foo9
Any of these is fine as the result. To be more clear, mixing of values from different rows is not allowed. A row must be returned exactly as is.
Here is how you could do this:
1) Create a helper series using Series.value_counts
to get the order
2) Index your original df with this helper series and drop duplicate col1
values.
s = df.col1.value_counts()
df.set_index('col1').loc[s.index].reset_index().drop_duplicates('col1')
or in one line:
df2 = (df.set_index('col1')
.loc[df.col1.value_counts().index]
.reset_index()
.drop_duplicates('col1'))
[Output]
col1 col2 col3 col4 col5
0 a a1 foo1 foo2 foo3
3 b b1 foo4 foo5 foo6
5 c c1 foo7 foo8 foo9
Here is a pretty straightforward way to do this, first sort by col1, then drop duplicates:
import pandas as pd
df = pd.read_csv('funky.csv')
df.sort_values('col1', ascending=True, inplace=True)
df
output for part 1:
col1 col2 col3 col4 col5
0 a a1 foo1 foo2 foo3
3 a a2 foo10 foo11 foo12
4 a a3 foo13 foo14 foo15
1 b b1 foo4 foo5 foo6
5 b b2 foo16 foo17 foo18
2 c c1 foo7 foo8 foo9
then simply drop duplicates in column1:
df2 = df.drop_duplicates(['col1'])
df2
output:
col1 col2 col3 col4 col5
0 a a1 foo1 foo2 foo3
1 b b1 foo4 foo5 foo6
2 c c1 foo7 foo8 foo9
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.