Suppose I have a dataframe that looks like this
d = {'User' : ['A', 'A', 'B', 'C', 'C', 'C'],
'time':[1,2,3,4,4,4],
'state':['CA', 'CA', 'ID', 'OR','OR','OR']}
df = pd.DataFrame(data = d)
Now suppose I want to create new dataframe that takes the average and median of time, grabs the users state, and generate a new column as well that counts the number of times that user appears in the User
column, ie
d = {'User' : ['A', 'B', 'C'],
'avg_time':[1.5,3,4],
'median_time':[1.5,3,4],
'state':['CA','ID','OR'],
'user_count':[2,1,3]}
df_res = pd.DataFrame(data=d)
I know that I can do a group by mean statement like this
df.groupby(['User'], as_index=False).mean().groupby('User')['time'].mean()
This gives me a pandas series, and I assume I can make this into a dataframe if I wanted but how would I do the latter above for all the other columns I am interested in?
Try using pd.NamedAgg :
df.groupby('User').agg(avg_time=('time','mean'),
mean_time=('time','median'),
state=('state','first'),
user_count=('time','count')).reset_index()
Output:
User avg_time mean_time state user_count
0 A 1.5 1.5 CA 2
1 B 3.0 3.0 ID 1
2 C 4.0 4.0 OR 3
You can even pass multiple aggregate functions for the columns in the form of dictionary, something like this:
out = df.groupby('User').agg({'time': [np.mean, np.median], 'state':['first']})
time state
mean median first
User
A 1.5 1.5 CA
B 3.0 3.0 ID
C 4.0 4.0 OR
It gives multi-level columns, you can either drop the level or just join them:
>>> out.columns = ['_'.join(col) for col in out.columns]
time_mean time_median state_first
User
A 1.5 1.5 CA
B 3.0 3.0 ID
C 4.0 4.0 OR
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.