[英]How do I use Pandas to get the sum of 2 columns based on the value of another column in a dataframe
I have a dataset with a lot of columns and I already have filtered the ones that I need:我有一个包含很多列的数据集,并且我已经过滤了我需要的那些:
import pandas as pd
data = pd.read_csv("./data/GL1871.txt", header=None, usecols=[3,6,9,10])
new_data = data.rename(columns={3: 'Away', 6: 'Home', 9: 'Away runs', 10: 'Home runs'})
What I want to get out of this dataframe is the sum of Column 'Away runs' and 'Home runs' for each team.我想从这个数据框中得到的是每个团队的“客场跑”和“本垒打”列的总和。 The output should look something like this:
输出应如下所示:
0 CL1 364
1 BS1 254
...
9 CH1 190
So far I have tried the groupby() method and the output is not what I really need:到目前为止,我已经尝试了 groupby() 方法,但输出并不是我真正需要的:
runs_away = new_data.groupby( 'Home')['Away runs'].sum()
runs_away = new_data.groupby( 'Home')['Home runs'].sum()
Home
BS1 165
CH1 139
CL1 162
FW1 112
NY2 127
PH1 124
RC1 66
TRO 231
WS3 120
Name: Away runs, dtype: int64
Home
BS1 223
CH1 197
CL1 119
FW1 78
NY2 178
PH1 180
RC1 72
TRO 200
WS3 166
Name: Home runs, dtype: int64
Is there any smart way to do this and getting both values at the same time?有没有什么聪明的方法可以做到这一点并同时获得两个值? Maybe a comprehension would be better, but I don't know how to iterate a Dataframe.
也许理解会更好,但我不知道如何迭代 Dataframe。 Thank you in advance.
先感谢您。
Also, the expected result is to get the sum of the runs in Away runs and Home runs for the same team.此外,预期结果是获得同一团队的客场跑和本垒打的总和。 So sum Away runs where the Away is team x + Home runs where Home is team x as well
所以总和 客场跑,其中客队是 x 队 + 本垒打,主场也是 x 队
Adding sample from the dataset:从数据集中添加样本:
Away Home Away runs Home runs
0 CL1 FW1 0 2
1 BS1 WS3 20 18
2 CL1 RC1 12 4
3 CL1 CH1 12 14
4 BS1 TRO 9 5
Desired output:期望的输出:
0 CL1 364
1 BS1 254
...
9 CH1 190
#Where 364 is the sum of all runs of the team CL1 wether it was away or home #其中 364 是 CL1 团队所有运行的总和,无论是客场还是主场
Away Home Away runs Home runs
0 CL1 FW1 0 2
1 BS1 WS3 20 18
2 CL1 RC1 12 4
3 CL1 CH1 12 14
4 BS1 TRO 9 5
5 CH1 CL1 18 10
6 WS3 CL1 12 8
7 CH1 FW1 14 5
8 WS3 FW1 6 12
9 TRO BS1 29 14
10 WS3 CH1 4 14
It depends on how the dataframe is structured,but if the run_aways are pandas series, you can simply add them.这取决于数据框的结构,但如果 run_aways 是熊猫系列,您可以简单地添加它们。 You could also do it in one line like :
您也可以在一行中完成,例如:
runs = new_data.groupby( 'Home')['Away runs'].sum() + new_data.groupby( 'Home')['Home runs'].sum()
You could also use a mask and the .apply method but I find it les readable.你也可以使用掩码和 .apply 方法,但我觉得它不太可读。
Pretty sure I understand what you're trying to do.我很确定我明白你想要做什么。 You'll need to do some reshaping of your data to stack the teams and runs so that they're in 2 long columns.
您需要对数据进行一些重塑以堆叠团队并运行,以便它们位于 2 个长列中。 Then you can perform a groupby operation to get the total number of runs per team.
然后您可以执行 groupby 操作以获取每个团队的总运行次数。
away = df[["Away", "Away runs"]]
home = df[["Home", "Home runs"]]
new_df = pd.DataFrame(np.vstack([home, away]), columns=["team", "runs"])
print(new_df)
team runs
0 FW1 2
1 WS3 18
2 RC1 4
3 CH1 14
4 TRO 5
5 CL1 10
6 CL1 8
7 FW1 5
8 FW1 12
9 BS1 14
10 CH1 14
11 CL1 0
12 BS1 20
13 CL1 12
14 CL1 12
15 BS1 9
16 CH1 18
17 WS3 12
18 CH1 14
19 WS3 6
20 TRO 29
21 WS3 4
Now that the teams are all in a single column, and their corresponding runs in the other.现在团队都在一个列中,而他们相应的运行在另一个列中。 We can perform a simple
groupby
operation to calculate the total number of runs per team, ignoring whether they were "home"
or "away"
我们可以执行一个简单的
groupby
操作来计算每支球队的总跑步次数,忽略他们是"home"
还是"away"
team_runs = new_df.groupby("team", as_index=False)["runs"].sum()
print(team_runs)
team runs
0 BS1 43
1 CH1 60
2 CL1 42
3 FW1 19
4 RC1 4
5 TRO 34
6 WS3 40
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.