简体   繁体   中英

pandas groupby count rate

I want to summarize for csv table using pandas for python package.

the table has a scheme like below

name_id | visit_address_no
   0    |       230
   0    |       223
   0    |       230
   2    |       120
   2    |       120
   2    |       132
   2    |       110

I want to summarize this table like below

name_id | visit_address_no | visit_count | visit_rate
   0    |       230        |      2      |    0.666
   0    |       223        |      1      |    0.333
   2    |       120        |      2      |    0.5
   2    |       132        |      1      |    0.25
   2    |       110        |      1      |    0.25

How can I make this summary for the csv table using pandas ?

I tried to

gb = df.groupby(['name_no', 'visit_address_no'])
gb.size()

but I can't get rate and pandas dataframe style.

df['name_count'] = df.groupby(['name_id'])['name_id'].transform(len)
df['visit_count'] = df.groupby(['name_id', 'visit_address_no'])['name_id'].transform(len)
summary_df = df.groupby(['name_id', 'visit_address_no']).agg('mean').reset_index()
summary_df['visit_rate'] = summary_df['visit_count']/summary_df['name_count']

This adds the extra column name_count , which you can drop with summary_df.drop(['name_count], axis=1, inplace=True) . It also strikes me as somewhat inelegant -- I suspect the second and third lines could be combined.

EDIT -- ah, here's the cleverer way:

df['name_count'] = df.groupby(['name_id'])['name_id'].transform(len)
grps = df.groupby(['name_id', 'visit_address_no'])['name_count']
summary_df = grps.agg({'visit_count': 'count',
                       'visit_rate': lambda x: len(x)/mean(x)}).reset_index()
def f(s):
    count = s.value_counts()
    rate = count / count.sum()
    return pd.DataFrame({"count":count, "rate":rate})

df2 = df.groupby("name_id")["visit_address_no"].apply(f).reset_index()

first of all,

Make sure you are referencing the column properly. In your code you say

gb = df.groupby(['name_no', 'visit_address_no'])

This should be name_id like in your dataframe

Also make sure name_id is not your index. When creating your df make sure you use

df = pd.DataFrame.from_csv('Book1.csv', index_col=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM