简体   繁体   中英

Sort rows of a Pandas DataFrame based on aggregated count and get one row randomly

I have a Pandas DataFrame with columns like this:

col1 col2 col3 col4 col5

a a1 foo1 foo2 foo3

b b1 foo4 foo5 foo6

c c1 foo7 foo8 foo9

a a2 foo10 foo11 foo12

a a3 foo13 foo14 foo15

b b2 foo16 foo17 foo18

I would like to sort the rows (the entire rows) of this dataframe based on descending frequency of values in col1 and then for other columns in the row, get one of the rows that have that value in their column (similar to SQL group by). How can I do that in Pandas? I believe this is some combination of groupby and sort_values, but I'm not exactly sure how to do it.

For the above example, a is the most frequent value in col1 , followed by b and c . So I would like the first row of the resulting dataframe to be one of the rows with col1 value being a . The next row should be one of the two rows with value b . And the last row is the only row having value c .

So this is one answer:

col1 col2 col3 col4 col5

a a1 foo1 foo2 foo3

b b1 foo4 foo5 foo6

c c1 foo7 foo8 foo9

but so is this one:

col1 col2 col3 col4 col5

a a3 foo13 foo14 foo15

b b1 foo4 foo5 foo6

c c1 foo7 foo8 foo9

And this one:

col1 col2 col3 col4 col5

a a2 foo10 foo11 foo12

b b2 foo16 foo17 foo18

c c1 foo7 foo8 foo9

Any of these is fine as the result. To be more clear, mixing of values from different rows is not allowed. A row must be returned exactly as is.

Here is how you could do this:

1) Create a helper series using Series.value_counts to get the order

2) Index your original df with this helper series and drop duplicate col1 values.

s = df.col1.value_counts()
df.set_index('col1').loc[s.index].reset_index().drop_duplicates('col1')

or in one line:

df2 = (df.set_index('col1')
       .loc[df.col1.value_counts().index]
       .reset_index()
       .drop_duplicates('col1'))

[Output]

    col1    col2    col3    col4    col5
0   a       a1      foo1    foo2    foo3
3   b       b1      foo4    foo5    foo6
5   c       c1      foo7    foo8    foo9

Here is a pretty straightforward way to do this, first sort by col1, then drop duplicates:

import pandas as pd
df = pd.read_csv('funky.csv')
df.sort_values('col1', ascending=True, inplace=True)
df

output for part 1:

  col1 col2   col3   col4   col5
0    a   a1   foo1   foo2   foo3
3    a   a2  foo10  foo11  foo12
4    a   a3  foo13  foo14  foo15
1    b   b1   foo4   foo5   foo6
5    b   b2  foo16  foo17  foo18
2    c   c1   foo7   foo8   foo9

then simply drop duplicates in column1:

df2 = df.drop_duplicates(['col1'])
df2

output:

  col1 col2  col3  col4  col5
0    a   a1  foo1  foo2  foo3
1    b   b1  foo4  foo5  foo6
2    c   c1  foo7  foo8  foo9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM