[英]How to calculate ratio of values in a pandas dataframe column?
I'm new to pandas and decided to learn it by playing around with some data I pulled from my favorite game's API. I have a dataframe with two columns "playerId" and "winner" like so:我是 pandas 的新手,决定通过玩弄我从我最喜欢的游戏 API 中提取的一些数据来学习它。我有一个 dataframe,其中包含两列“playerId”和“winner”,如下所示:
playerStatus:
______________________
playerId winner
0 1848 True
1 1988 False
2 3543 True
3 1848 False
4 1988 False
...
Each row represents a match the player participated in. My goal is to either transform this dataframe or create a new one such that the win percentage for each playerId is calculated.每行代表玩家参加的一场比赛。我的目标是转换此 dataframe 或创建一个新的,以便计算每个 playerId 的获胜百分比。 For example, the above dataframe would become:例如,上面的 dataframe 将变为:
playerWinsAndTotals
_________________________________________
playerId wins totalPlayed winPct
0 1848 1 2 50.0000
1 1988 0 2 0.0000
2 3543 1 1 100.0000
...
It took quite a while of reading pandas docs, but I actually managed to achieve this by essentially creating two different tables (one to find the number of wins for each player, one to find the total games for each player), and merging them, then taking the ratio of wins to games played.阅读 pandas 文档花了很长时间,但我实际上通过创建两个不同的表(一个用于查找每个玩家的获胜次数,一个用于查找每个玩家的总游戏数)并合并它们来实现这一点,然后计算获胜次数与所玩游戏的比率。
Creating the "wins" dataframe:创建“胜利”dataframe:
temp_df = playerStatus[['playerId', 'winner']].value_counts().reset_index(name='wins')
onlyWins = temp_df[temp_df['winner'] == True][['playerId', 'wins']]
onlyWins
_________________________
playerId wins
1 1670 483
3 1748 474
4 2179 468
6 4006 434
8 1668 392
...
Creating the "totals" dataframe:创建“总计”dataframe:
totalPlayed = playerStatus['playerId'].value_counts().reset_index(name='totalCount').rename(columns={'index': 'playerId'})
totalPlayed
____________________
playerId totalCount
0 1670 961
1 1748 919
2 1872 877
3 4006 839
4 2179 837
...
Finally, merging them and adding the "winPct" column.最后,合并它们并添加“winPct”列。
playerWinsAndTotals = onlyWins.merge(totalPlayed, on='playerId', how='left')
playerWinsAndTotals['winPct'] = playerWinsAndTotals['wins']/playerWinsAndTotals['totalCount'] * 100
playerWinsAndTotals
_____________________________________________
playerId wins totalCount winPct
0 1670 483 961 50.260146
1 1748 474 919 51.577802
2 2179 468 837 55.913978
3 4006 434 839 51.728248
4 1668 392 712 55.056180
...
Now, the reason I am posting this here is because I know I'm not taking full advantage of what pandas has to offer.现在,我在这里发布这个的原因是因为我知道我没有充分利用 pandas 提供的功能。 Creating and merging two different dataframes just to find the ratio of player wins seems unnecessary.创建和合并两个不同的数据框只是为了找到玩家获胜的比率似乎是不必要的。 I feel like I took the "scenic" route on this one.我觉得我在这一条上走的是“风景”路线。
To anyone more experienced than me, how would you tackle this problem?对于比我更有经验的人,您将如何解决这个问题?
We can take advantage of the way that Boolean values are handled mathematically ( True
being 1
and False
being 0
) and use 3 aggregation functions sum
, count
and mean
per group ( groupby aggregate
).我们可以利用 Boolean 值的数学处理方式( True
为1
, False
为0
),并使用 3 个聚合函数sum
、 count
和每组mean
( groupby aggregate
)。 We can also take advantage of Named Aggregation to both create and rename the columns in one step:我们还可以利用命名聚合一步创建和重命名列:
df = (
df.groupby('playerId', as_index=False)
.agg(wins=('winner', 'sum'),
totalCount=('winner', 'count'),
winPct=('winner', 'mean'))
)
# Scale up winPct
df['winPct'] *= 100
df
: df
:
playerId wins totalCount winPct
0 1848 1 2 50.0
1 1988 0 2 0.0
2 3543 1 1 100.0
DataFrame and imports: DataFrame 及进口:
import pandas as pd
df = pd.DataFrame({
'playerId': [1848, 1988, 3543, 1848, 1988],
'winner': [True, False, True, False, False]
})
You can try something like this你可以尝试这样的事情
import pandas as pd
df = pd.read_csv('data.csv')
# If for any reason winner column is a string and not a boolean try
# import numpy as np
# df['winner'] = np.where(df['winner'] == 'True', 1, 0)
df = df.groupby('playerId')['winner'].agg(['count', 'sum'])
df['percentage'] = 100 * df['sum'] / df['count']
df = df.rename(columns={'count': 'total', 'sum': 'wins'})
print(df)
prints印刷
total wins percentage
playerId
1848 2 1 50.0
1988 2 0 0.0
3543 1 1 100.0
Data I used我使用的数据
playerId,winner
1848,True
1988,False
3543,True
1848,False
1988,False
In your case just do mean
can yield the pct在你的情况下只是mean
可以产生 pct
out = df.groupby('playerId')['winner'].agg(['sum','count','mean'])
Out[22]:
sum count mean
playerId
1848 1 2 0.5
1988 0 2 0.0
3543 1 1 1.0
Try:尝试:
import pandas as pd
import numpy as np
df = pd.DataFrame({'playerId': {0: 1848, 1: 1988, 2: 3543, 3: 1848, 4: 1988},
'winner': {0: True, 1: False, 2: True, 3: False, 4: False}})
s = df.groupby('playerId')['winner'].apply(lambda x: (np.sum(x)/len(x)*100))
df = (df.groupby('playerId')
.agg({'playerId':'count', 'winner': 'sum'})
.rename(columns={'winner':'wins','playerId':'totalPlayed'})
.reset_index()
)
df['winPct'] = df['playerId'].map(s)
df = df[['playerId', 'wins', 'totalPlayed', 'winPct']]
print(df)
playerId wins totalPlayed winPct
0 1848 1 2 50.0
1 1988 0 2 0.0
2 3543 1 1 100.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.