[英]Most efficient way to work with a string column in a pandas Dataframe
我有一个带有足球结果的 DataFrame:
home_team away team home_team_goal_timings away_team_goal_timings
0 Tottenham Hotspur Manchester City 24,56 77,88
1 Sunderland Birmingham City 15,40,66 16,38,43,75
2 Aston Villa West Ham United 14 6,44,55,63,68,90
3 Chelsea Everton 37,39 12,32,39,49,58,83
4 Arsenal Stoke City 6,44,55,63,68,90 57,71
对于 DataFrame 创建:
data = {'home_team': ['Tottenham Hotspur', 'Sunderland', 'Aston Villa', 'Chelsea', 'Arsenal'],
'away_team':['Manchester City', 'Birmingham City', 'West Ham United', 'Everton', 'Stoke City'],
'home_team_goal_timings':['24,56', '15,40,66', '14', '37,39', '6,44,55,63,68,90'],
'away_team_goal_timings': ['77,88', '16,38,43,75', '6,44,55,63,68,90', '12,32,39,49,58,83',
'57,71']}
test = pd.DataFrame(data)
我想从原来的 DataFrame 中切出所有主队在第 20 分钟之前得分的比赛,是否可以在当前格式上切列?
您可以使用.loc
和.apply
来做到这一点。 lambda 将字符串拆分为','
并获取第一个元素。 如果低于 20 则返回True
,否则返回False
。
print(test.loc[test.home_team_goal_timings.apply(lambda x: int(x.split(',')[0]) < 20 if x else False)])
home_team away_team home_team_goal_timings away_team_goal_timings
1 Sunderland Birmingham City 15,40,66 16,38,43,75
2 Aston Villa West Ham United 14 6,44,55,63,68,90
4 Arsenal Stoke City 6,44,55,63,68,90 57,71
注意:这确实假设home_team_goal_timings
是按升序排列的。 lambda 中的if x
检查是针对没有目标的情况。
我们可以使用Series.str.split
拆分逗号并使用Series.str[0]
获取第一个元素,然后检查 integer 是否< 20
:
m = test['home_team_goal_timings'].str.split(',').str[0].astype(int) < 20
test[m]
home_team away_team home_team_goal_timings away_team_goal_timings
1 Sunderland Birmingham City 15,40,66 16,38,43,75
2 Aston Villa West Ham United 14 6,44,55,63,68,90
4 Arsenal Stoke City 6,44,55,63,68,90 57,71
这里还有一个变化:
test.loc[np.vectorize(lambda r: int(r.split(',')[0]) < 20)(df.home_team_goal_timings.values)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.