简体   繁体   中英

What is the optimal way to create a new column in Pandas dataframe based on conditions from another row?

I have a Pandas dataframe, week1_plays in the following format:

在此处输入图像描述

What I want to do is add a column week1_plays['distance_from_receiver'] such that for each row in the dataframe, we grab the keys of gameId, playId, frameId and find the x and y position of the player with those keys and position == 'WR' . Then I'll calculate the distance from the receiver with the following function:

def get_distance(rec_x, rec_y, def_x, def_y):
    distance = np.sqrt( ((def_x - rec_x)**2) + ((def_y - rec_y)**2) )
    return distance

For example using the sample provided, the row 0 input to the function would be

get_distance(91.35, 44.16, 88.89, 36.47)

The current solution I have is to use a lambda function on the dataframe as such:

week1_topReceivers['distance_from_receiver'] = week1_topReceivers.apply(lambda row: get_distance(week1_wr_position.loc[np.where((week1_topReceivers['playId'] == row['playId']) & (week1_topReceivers['frameId'] == row['frameId']) & (week1_topReceivers['gameId'] == row['frameId']))]['x'],
week1_topReceivers.loc[np.where((week1_topReceivers['playId'] == row['playId']) & (week1_topReceivers['frameId'] == row['frameId']) & (week1_topReceivers['gameId'] == row['frameId']))]['y'], row['x'], row['y']), axis = 1)

but querying the dataframe for the first two inputs takes a very long time with a large dataframe. I know there has to be a more optimal solution to this but my searches online aren't turning up any better options.

EDIT: Here is a larger sample and the expected output:

SAMPLE

x   y   o   dir event   position    frameId team    gameId  playId  playDirection   route
88.89   36.47   105.63  66.66   None    SS  1   home    2018090600  75  left    NaN
91.35   44.16   290.45  16.86   None    WR  1   away    2018090600  75  left    HITCH
86.31   22.01   70.12   168.91  None    FS  1   home    2018090600  75  left    NaN
73.64   28.70   103.05  219.41  None    FS  1   home    2018090600  75  left    NaN
86.48   31.12   95.90   33.36   None    MLB 1   home    2018090600  75  left    NaN
82.67   20.53   81.14   174.57  None    CB  1   home    2018090600  75  left    NaN
84.00   43.49   108.23  110.32  None    CB  1   home    2018090600  75  left    NaN
85.63   26.59   87.69   38.80   None    LB  1   home    2018090600  75  left    NaN
88.89   36.47   105.63  68.49   None    SS  2   home    2018090600  75  left    NaN
91.37   44.17   290.45  29.61   None    WR  2   away    2018090600  75  left    HITCH
86.32   22.00   70.88   119.04  None    FS  2   home    2018090600  75  left    NaN
73.64   28.70   104.57  228.17  None    FS  2   home    2018090600  75  left    NaN
86.48   31.11   101.10  30.26   None    MLB 2   home    2018090600  75  left    NaN
82.68   20.53   82.24   147.46  None    CB  2   home    2018090600  75  left    NaN
84.02   43.49   107.33  106.73  None    CB  2   home    2018090600  75  left    NaN
85.64   26.61   87.69   37.51   None    LB  2   home    2018090600  75  left    NaN
88.88   36.47   107.02  57.53   None    SS  3   home    2018090600  75  left    NaN
91.37   44.17   290.45  32.20   None    WR  3   away    2018090600  75  left    HITCH
86.33   22.00   71.88   93.49   None    FS  3   home    2018090600  75  left    NaN
73.63   28.69   104.57  227.74  None    FS  3   home    2018090600  75  left    NaN

EXPECTED OUTPUT:

x   y   o   dir event   position    frameId team    gameId  playId  playDirection   route   distance_from_receiver
88.89   36.47   105.63  66.66   None    SS  1   home    2018090600  75  left    NaN 8.07
91.35   44.16   290.45  16.86   None    WR  1   away    2018090600  75  left    HITCH   0.00
86.31   22.01   70.12   168.91  None    FS  1   home    2018090600  75  left    NaN 22.72
73.64   28.70   103.05  219.41  None    FS  1   home    2018090600  75  left    NaN 23.51
86.48   31.12   95.90   33.36   None    MLB 1   home    2018090600  75  left    NaN 13.92
82.67   20.53   81.14   174.57  None    CB  1   home    2018090600  75  left    NaN 25.17
84.00   43.49   108.23  110.32  None    CB  1   home    2018090600  75  left    NaN 7.38
85.63   26.59   87.69   38.80   None    LB  1   home    2018090600  75  left    NaN 18.48
88.89   36.47   105.63  68.49   None    SS  2   home    2018090600  75  left    NaN 8.09
91.37   44.17   290.45  29.61   None    WR  2   away    2018090600  75  left    HITCH   0.00
86.32   22.00   70.88   119.04  None    FS  2   home    2018090600  75  left    NaN 22.74
73.64   28.70   104.57  228.17  None    FS  2   home    2018090600  75  left    NaN 23.53
86.48   31.11   101.10  30.26   None    MLB 2   home    2018090600  75  left    NaN 13.95
82.68   20.53   82.24   147.46  None    CB  2   home    2018090600  75  left    NaN 25.19
84.02   43.49   107.33  106.73  None    CB  2   home    2018090600  75  left    NaN 7.39
85.64   26.61   87.69   37.51   None    LB  2   home    2018090600  75  left    NaN 18.47
88.88   36.47   107.02  57.53   None    SS  3   home    2018090600  75  left    NaN 8.09
91.37   44.17   290.45  32.20   None    WR  3   away    2018090600  75  left    HITCH   0.00
86.33   22.00   71.88   93.49   None    FS  3   home    2018090600  75  left    NaN 22.74
73.63   28.69   104.57  227.74  None    FS  3   home    2018090600  75  left    NaN 23.54

You are looking for a merge or join operation. Try something like this:

df = pd.DataFrame({'gameId':[1,1,1,1,1,1],'playId':[1,1,1,1,1,1],
                   'frameId':[1,1,1,2,2,2], 'position':['A','B','WR','C','WR','D'],
                   'x':[87,56,45,34,45,67], 'y':[25,36,47,365,25,36]})

# create a table with just the wide receiver positions:
wr = df.loc[df.position=='WR'].drop(columns='position')

# merge the wide receiver x,y values into the original table based on the keys:
df = df.merge(wr, how='outer', on=['gameId', 'playId', 'frameId'], suffixes=['', '_wr'])

# apply your function to calculate the column (avoid using apply because it's super slow)
df['dist_from_wr'] = [get_distance(x, y, x_wr, y_wr) for x, y, x_wr, y_wr
                      in zip(df.x, df.y, df.x_wr, df.y_wr)]

Note as well, that you're lucky here because your function is already vectorized (which is not always the case) so you can actually do this even more efficiently by passing entire columns as input arguments as follows:

df['dist_from_wr'] = get_distance(df.x, df.y, df.x_wr, df.y_wr)

Result:

| gameId | playId | frameId | position |   x |   y | x_wr | y_wr | dist_from_wr |
|-------:|-------:|--------:|:---------|----:|----:|-----:|-----:|-------------:|
|      1 |      1 |       1 | A        |  87 |  25 |   45 |   47 |      47.4131 |
|      1 |      1 |       1 | B        |  56 |  36 |   45 |   47 |      15.5563 |
|      1 |      1 |       1 | WR       |  45 |  47 |   45 |   47 |       0      |
|      1 |      1 |       2 | C        |  34 | 365 |   45 |   25 |     340.178  |
|      1 |      1 |       2 | WR       |  45 |  25 |   45 |   25 |       0      |
|      1 |      1 |       2 | D        |  67 |  36 |   45 |   25 |      24.5967 |

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM