简体   繁体   中英

Finding and most frequent string using groupby in pandas

I'm trying to find the name of the person who submitted the most applications in any given year over a series of years.

Each application is its own row in the dataframe. It comes with the year it was submitted, and the applicant's name.

I tried using groupby to organize the data by year and name, then a variety of methods such as value_counts() , count() , max() , etc...

This is the closest I've gotten:

df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)

It produces the following output:

app_year_start        name               total_apps
2015                  John Smith         622
2013                  John Smith         614
2014                  Jane Doe           611
2016                  Jon Snow           549

My desired output:

app_year_start        name                  total_apps
2015                  top_applicant         max_num
2014                  top_applicant         max_num
2013                  top_applicant         max_num
2012                  top_applicant         max_num

Some lines of dummy data:

app_year_start        name
2012                  John Smith
2012                  John Smith
2012                  John Smith
2012                  Jane Doe
2013                  Jane Doe
2012                  John Snow
2015                  John Snow
2014                  John Smith
2015                  John Snow
2012                  John Snow
2012                  John Smith
2012                  John Smith
2012                  John Smith
2012                  John Smith
2012                  Jane Doe
2013                  Jane Doe
2012                  John Snow
2015                  John Snow
2014                  John Smith
2015                  John Snow
2012                  John Snow
2012                  John Smith

I've consulted the follow SO posts:

Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

Pandas groupby nlargest sum

Get max of count() function on pandas groupby objects

Some other attempts I've made:

df3.groupby(['app_year_start'])['name'].value_counts().sort_values(ascending=False)

df3.groupby(['app_year_start','name']).count()

Any help would be appreciated. I'm also open to entirely different solutions as well.

Cross-tabulate and find max values.

(
    # cross tabulate to get each applicant's number of applications
    pd.crosstab(df['app_year_start'], df['name'])
    # the applicant with most applications and their counts
    .agg(['idxmax', 'max'], 1)
    # change column names
    .set_axis(['name','total_apps'], axis=1)
    # flatten df
    .reset_index()
)

结果

You can use mode per group:

df.groupby('app_year_start')['name'].agg(lambda x: x.mode().iloc[0])

Or, if you want all values joined as a single string in case of a tie:

df.groupby('app_year_start')['name'].agg(lambda x: ', '.join(x.mode()))

Output:

app_year_start
2012    John Smith
2013      Jane Doe
2014    John Smith
2015     John Snow
Name: name, dtype: object

Variant of your initial code:

(df
 .groupby(['app_year_start', 'name'])['name']
 .agg(total_apps='count')
 .sort_values(by='total_apps', ascending=False)
 .reset_index()
 .groupby('app_year_start', as_index=False)
 .first()
 )

Output:

   app_year_start        name  total_apps
0            2012  John Smith           8
1            2013    Jane Doe           2
2            2014  John Smith           2
3            2015   John Snow           4

With value_counts and a groupby :

dfc = (df.value_counts().reset_index().groupby('app_year_start').max()
          .sort_index(ascending=False).reset_index() 
          .rename(columns={0:'total_apps'})
      )

print(dfc)

Result

   app_year_start        name  total_apps
0            2015   John Snow           4
1            2014  John Smith           2
2            2013    Jane Doe           2
3            2012   John Snow           8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM