[英]Is there a way to go through rows in a pandas data frame more efficiently?
I have a huge pandas data frame where each row corresponds to a single sports match. 我有一个庞大的熊猫数据框,其中每一行对应一个体育比赛。 It looks like the following:
它看起来如下:
**EDIT: I'll change the example code to better reflect the actual data: This made me realize the presence of values other than 'lost' or 'won' makes this a lot more difficult. **编辑:我将更改示例代码以更好地反映实际数据:这让我意识到“丢失”或“赢”之外的值的存在使得这更加困难。
d = {'date': ['21.01.96', '22.02.96', '23.02.96', '24.02.96', '25.02.96',
'26.02.96', '27.02.96', '28.02.96', '29.02.96', '30.02.96'],
'challenger': [5, 5, 10, 5, 4, 5, 8, 8, 10, 8],
'opponent': [2, 4, 5, 4, 5, 10, 5, 2, 4, 10],
'outcome': ['win', 'lost', 'declined', 'win', 'declined', 'win', 'declined', 'declined', 'lost', 'lost']
}
df = pd.DataFrame(data=d)
For each matchup I want to calculate previous wins/losses in a new variable. 对于每场比赛,我想计算一个新变量的先前赢/输。 In the example case, the 'prev_wins' variable would be [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].
在示例情况下,'prev_wins'变量将是[0,0,0,1,0,0,0,0,0,0]。 I did manage to create working code for this, which looks like this:
我确实设法为此创建了工作代码,如下所示:
data['prev_wins_spec_challenger'] = 0
data['prev_losses_spec_challenger'] = 0
data['challenger'] = data['challenger'].astype(str)
data['opponent'] = data['opponent'].astype(str)
data['matchups'] = data['challenger'] + '-' + data['opponent']
# create list of matchups with unique pairings
matchups_temp = list(data['matchups'].unique())
matchups = []
for match in matchups_temp:
if match[::-1] in matchups:
pass
else:
matchups.append(match)
prev_wins = {}
for i in matchups:
prev_wins[i] = 0
prev_losses = {}
for i in matchups:
prev_losses[i] = 0
# go through data set for each matchup and calculate variables
for i in range(0, len(matchups)):
match = matchups[i].split('-')
challenger = match[0]
opponent = match[1]
for index, row in data.iterrows():
if row['challenger'] == challenger and row['opponent'] == opponent:
if row['outcome'] == 'won':
data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
prev_wins[matchups[i]] += 1
elif row['outcome'] == 'lost':
data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
prev_losses[matchups[i]] += 1
elif row['challenger'] == opponent and row['opponent'] == challenger:
if row['outcome'] == 'won':
data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
prev_losses[matchups[i]] += 1
elif row['outcome'] == 'lost':
data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
prev_wins[matchups[i]] += 1
The problem with this is that it takes incredibly long cause there are a total of ~65.000 different matchups and the data frame has ~170.000 rows. 这个问题是它需要非常长的时间,因为总共有大约65,000个不同的比赛,而且数据帧有~170.000行。 On my laptop this would take around 180 hours to run, which is not acceptable.
在我的笔记本电脑上运行大约需要180个小时,这是不可接受的。
I am sure there is a better solution for this but even after searching the internet the whole day I was not able to find one. 我相信有一个更好的解决方案,但即使在搜索互联网一整天后我都找不到。 How can I make this code faster?
如何让这段代码更快?
IIUC, groupby
and cumsum
IIUC,
groupby
和cumsum
df['outcome'] = df.outcome.map({'win':1, 'loss':0})
Then 然后
df.groupby('challenger').outcome.cumsum().sub(1).clip(lower=0)
Of course, you don't need to overwrite the values in outcome
(you can create a new column and work with it). 当然,您不需要覆盖
outcome
的值(您可以创建一个新列并使用它)。 But usually in pandas operations are way faster when working with int
s than when working with string
s. 但通常在使用
int
时,pandas操作比使用string
s时更快。 So from a performance point-of-view, it is preferable to have 0
and 1
representing wins and losses than having the actual words loss
and win
. 因此,从绩效的角度来看,最好是
0
和1
代表胜负,而不是实际的话语loss
和win
。
In the last layer, just when you are presenting the information, that's when you map back to human-understandable words. 在最后一层,就在您展示信息的时候,那就是当您映射回人类可理解的单词时。 But the processing don't usually need strings
但是处理通常不需要字符串
IIUC, you can do something like this, using shift()
to look at the previous outcomes, and getting the cumulative sum of the boolean of where it is equal to win
: IIUC,你可以做这样的事情,使用
shift()
来查看先前的结果,并获得它等于win
的布尔值的累积和:
data['previous_wins'] = data.groupby('challenger').outcome.transform(lambda x: x.shift().eq('win').cumsum())
>>> data
challenger date opponent outcome previous_wins
0 5 21.01.96 6 win 0
1 4 22.02.96 3 loss 0
2 5 23.02.96 6 win 1
If you're looking to count how many wins a challenger had against a specific opponent, you can just groupby both the challenger and opponent: 如果你想要计算一个挑战者对特定对手的胜利数量,你可以将挑战者和对手分组:
data['previous_wins'] = data.groupby(['opponent','challenger']).outcome.transform(lambda x: x.shift().eq('win').cumsum())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.