简体   繁体   中英

Calculate the percentage of values that meet multiple conditions in DataFrame

I have a DataFrame with information from every single March Madness game since 1985. Now I am trying to calculate the percentage of wins by the higher seed by round. The main DataFrame looks like this:

在此输入图像描述

I thought that the best way to do it is by creating separate functions. The first one deals with when the score is higher than the score.1 return team and when score.1 is higher than score return team.1 Then append those at end of function. Next one for needs u do seed.1 higher than seed and return team then seed higher than seed.1 and return team.1 then append and last function make a function for when those are equal

def func1(x):
    if tourney.loc[tourney['Score']] > tourney.loc[tourney['Score.1']]:
        return tourney.loc[tourney['Team']]
    elif tourney.loc[tourney['Score.1']] > tourney.loc[tourney['Score']]:
        return tourney.loc[tourney['Team.1']]

func1(tourney.loc[tourney['Score']])

You can apply a row-wise function by apply a lambda function to the entire dataframe, with the axis=1 . This will allow you to get a True/False column 'low_seed_wins' .

With the new column of True/False you can take the count and the sum (count being the number of games, and sum being the number of lower_seed victories). Using this you can divide the sum by the count to get the win ratio.

This only works because your lower seed teams are always on the left. If they are not it will be a little more complex.

import pandas as pd
df = pd.DataFrame([[1987,3,1,74,68,5],[1987,3,2,87,81,6],[1987,4,1,84,81,2],[1987,4,1,75,79,2]], columns=['Year','Round','Seed','Score','Score.1','Seed.1'])

df['low_seed_wins'] = df.apply(lambda row: row['Score'] > row['Score.1'], axis=1)

df = df.groupby(['Year','Round'])['low_seed_wins'].agg(['count','sum']).reset_index()

df['ratio'] = df['sum'] / df['count']

df.head()


Year    Round   count   sum     ratio
0   1987    3   2       2.0     1.0
1   1987    4   2       1.0     0.5

You should be to calculate this by checking both conditions, for both the first and second team. This returns a boolean, the sum of which is the number of cases it is true. Then just divide by the length of the whole dataframe to get the percentage. Without test data hard to check exactly

(
    ((tourney['Seed'] > tourney['Seed.1']) & 
     (tourney['Score'] > tourney['Score.1'])) || 
    ((tourney['Seed.1'] > tourney['Seed']) & 
     (tourney['Score.1'] > tourney['Score']))
).sum() / len(tourney)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM