[英]Iterating through pandas groupby groups
我有一个如下所示的熊猫数据school_df
:
school_id date_posted date_completed
0 A 2014-01-01 2014-01-01
1 A 2014-01-01 2014-01-08
2 A 2014-04-29 2014-05-01
3 B 2014-01-01 2014-01-01
4 B 2014-01-20 2014-02-23
每一行代表该学校的一个项目。 我想添加两列:对于每个唯一的school_id
,该日期之前发布的项目数量以及该日期之前完成的项目数量。
下面的代码有效,但我有大约 300,000 所独特的学校,所以需要很长时间才能运行。 有没有更快的方法来获得我正在寻找的东西? 谢谢您的帮助!
import pandas as pd
groups = school_df.groupby("school_id")
blank_df = pd.DataFrame()
for g, df in groups:
df['school_previous_projects'] = df.date_posted.map(lambda x: len(df[df.date_posted < x]))
df['school_previous_completed'] = df.date_posted.map(lambda x: len(df[df.date_completed < x]))
blank_df = pd.concat([blank_df, df])
试试这个。 应该比你的 for 循环和两个地图更快。 从你的框架开始
school_id date_posted date_completed
0 A 2014-01-01 2014-01-01
1 A 2014-01-01 2014-01-08
2 A 2014-04-29 2014-05-01
3 B 2014-01-01 2014-01-01
4 B 2014-01-20 2014-02-23
然后是一个函数。 getProjectCounts() 使用布尔索引和简单的 count()
def getProjectCounts(row, df):
filter = (df["school_id"] == row["school_id"]) & (df["date_posted"] < row["date_posted"])
dp_count = df[filter]["date_posted"].count()
filter = (df["school_id"] == row["school_id"]) & (df["date_completed"] < row["date_completed"])
dc_count = df[filter]["date_completed"].count()
return pd.Series([dp_count, dc_count])
然后一个 apply() 函数逐行
school_df[["school_previous_projects","school_previous_completed"]] = school_df.apply(lambda x : getProjectCounts(x, school_df),axis=1)
school_id date_posted date_completed school_previous_projects \
0 A 2014-01-01 2014-01-01 0
1 A 2014-01-01 2014-01-08 0
2 A 2014-04-29 2014-05-01 2
3 B 2014-01-01 2014-01-01 0
4 B 2014-01-20 2014-02-23 1
school_previous_completed
0 0
1 1
2 2
3 0
4 1
这是一个使用 cumcount 的版本(我简化了日期,但仍然可以工作):
import pandas as pd
import io
df = pd.DataFrame({'school_id': ['A', 'A', 'A', 'B', 'B'],
'date_posted': pd.date_range('2014-01-01', '2014-01-05'),
'date_completed': pd.date_range('2014-01-01', '2014-01-05')})
posted = df.set_index('date_posted').groupby('school_id').cumcount()
comp = df.set_index('date_completed').groupby('school_id').cumcount()
df['posted'] = posted.values
df['comp'] = comp.values
print df
结果是:
date_completed date_posted school_id posted comp
0 2014-01-01 2014-01-01 A 0 0
1 2014-01-02 2014-01-02 A 1 1
2 2014-01-03 2014-01-03 A 2 2
3 2014-01-04 2014-01-04 B 0 0
4 2014-01-05 2014-01-05 B 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.