[英]Most efficient way to work with a string column in a pandas Dataframe
I have a DataFrame with soccer results in it:我有一个带有足球结果的 DataFrame:
home_team away team home_team_goal_timings away_team_goal_timings
0 Tottenham Hotspur Manchester City 24,56 77,88
1 Sunderland Birmingham City 15,40,66 16,38,43,75
2 Aston Villa West Ham United 14 6,44,55,63,68,90
3 Chelsea Everton 37,39 12,32,39,49,58,83
4 Arsenal Stoke City 6,44,55,63,68,90 57,71
For DataFrame Creation:对于 DataFrame 创建:
data = {'home_team': ['Tottenham Hotspur', 'Sunderland', 'Aston Villa', 'Chelsea', 'Arsenal'],
'away_team':['Manchester City', 'Birmingham City', 'West Ham United', 'Everton', 'Stoke City'],
'home_team_goal_timings':['24,56', '15,40,66', '14', '37,39', '6,44,55,63,68,90'],
'away_team_goal_timings': ['77,88', '16,38,43,75', '6,44,55,63,68,90', '12,32,39,49,58,83',
'57,71']}
test = pd.DataFrame(data)
I would like to slice from the original DataFrame all games in which the home team scored before the 20th minute, is it possible to slice the column on the current format?我想从原来的 DataFrame 中切出所有主队在第 20 分钟之前得分的比赛,是否可以在当前格式上切列?
You could do so using .loc
and .apply
.您可以使用
.loc
和.apply
来做到这一点。 The lambda splits the string on ','
and takes the first element. lambda 将字符串拆分为
','
并获取第一个元素。 If that is lower than 20 it returns True
, else False
.如果低于 20 则返回
True
,否则返回False
。
print(test.loc[test.home_team_goal_timings.apply(lambda x: int(x.split(',')[0]) < 20 if x else False)])
home_team away_team home_team_goal_timings away_team_goal_timings
1 Sunderland Birmingham City 15,40,66 16,38,43,75
2 Aston Villa West Ham United 14 6,44,55,63,68,90
4 Arsenal Stoke City 6,44,55,63,68,90 57,71
Note: this does assume the home_team_goal_timings
are in ascending order.注意:这确实假设
home_team_goal_timings
是按升序排列的。 The if x
check in the lambda is for the case of no goals. lambda 中的
if x
检查是针对没有目标的情况。
We can use Series.str.split
to split on the commas and grab the first element with Series.str[0]
, then we check if the integer is < 20
:我们可以使用
Series.str.split
拆分逗号并使用Series.str[0]
获取第一个元素,然后检查 integer 是否< 20
:
m = test['home_team_goal_timings'].str.split(',').str[0].astype(int) < 20
test[m]
home_team away_team home_team_goal_timings away_team_goal_timings
1 Sunderland Birmingham City 15,40,66 16,38,43,75
2 Aston Villa West Ham United 14 6,44,55,63,68,90
4 Arsenal Stoke City 6,44,55,63,68,90 57,71
Here one more variation:这里还有一个变化:
test.loc[np.vectorize(lambda r: int(r.split(',')[0]) < 20)(df.home_team_goal_timings.values)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.