简体   繁体   中英

Most efficient way to work with a string column in a pandas Dataframe

I have a DataFrame with soccer results in it:

   home_team             away team         home_team_goal_timings   away_team_goal_timings
0  Tottenham Hotspur     Manchester City   24,56                    77,88
1  Sunderland            Birmingham City   15,40,66                 16,38,43,75
2  Aston Villa           West Ham United   14                       6,44,55,63,68,90
3  Chelsea               Everton           37,39                    12,32,39,49,58,83  
4  Arsenal               Stoke City        6,44,55,63,68,90         57,71

For DataFrame Creation:

data = {'home_team': ['Tottenham Hotspur', 'Sunderland', 'Aston Villa', 'Chelsea', 'Arsenal'],
   'away_team':['Manchester City', 'Birmingham City', 'West Ham United', 'Everton', 'Stoke City'],
   'home_team_goal_timings':['24,56', '15,40,66', '14', '37,39', '6,44,55,63,68,90'],
   'away_team_goal_timings': ['77,88', '16,38,43,75', '6,44,55,63,68,90', '12,32,39,49,58,83', 
    '57,71']}

test = pd.DataFrame(data)

I would like to slice from the original DataFrame all games in which the home team scored before the 20th minute, is it possible to slice the column on the current format?

You could do so using .loc and .apply . The lambda splits the string on ',' and takes the first element. If that is lower than 20 it returns True , else False .

print(test.loc[test.home_team_goal_timings.apply(lambda x: int(x.split(',')[0]) < 20 if x else False)])


     home_team        away_team home_team_goal_timings away_team_goal_timings
1   Sunderland  Birmingham City               15,40,66            16,38,43,75
2  Aston Villa  West Ham United                     14       6,44,55,63,68,90
4      Arsenal       Stoke City       6,44,55,63,68,90                  57,71

Note: this does assume the home_team_goal_timings are in ascending order. The if x check in the lambda is for the case of no goals.

We can use Series.str.split to split on the commas and grab the first element with Series.str[0] , then we check if the integer is < 20 :

m = test['home_team_goal_timings'].str.split(',').str[0].astype(int) < 20
test[m]

     home_team        away_team home_team_goal_timings away_team_goal_timings
1   Sunderland  Birmingham City               15,40,66            16,38,43,75
2  Aston Villa  West Ham United                     14       6,44,55,63,68,90
4      Arsenal       Stoke City       6,44,55,63,68,90                  57,71

Here one more variation:

test.loc[np.vectorize(lambda r: int(r.split(',')[0]) < 20)(df.home_team_goal_timings.values)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM