有没有办法更有效地浏览pandas数据框中的行？

Question

I have a huge pandas data frame where each row corresponds to a single sports match. 我有一个庞大的熊猫数据框，其中每一行对应一个体育比赛。 It looks like the following: 它看起来如下：

**EDIT: I'll change the example code to better reflect the actual data: This made me realize the presence of values other than 'lost' or 'won' makes this a lot more difficult. **编辑：我将更改示例代码以更好地反映实际数据：这让我意识到“丢失”或“赢”之外的值的存在使得这更加困难。

d = {'date': ['21.01.96', '22.02.96', '23.02.96', '24.02.96', '25.02.96',
          '26.02.96', '27.02.96', '28.02.96', '29.02.96', '30.02.96'], 
     'challenger': [5, 5, 10, 5, 4, 5, 8, 8, 10, 8],
     'opponent': [2, 4, 5, 4, 5, 10, 5, 2, 4, 10],
     'outcome': ['win', 'lost', 'declined', 'win', 'declined', 'win', 'declined', 'declined', 'lost', 'lost']
     }
df = pd.DataFrame(data=d)

For each matchup I want to calculate previous wins/losses in a new variable. 对于每场比赛，我想计算一个新变量的先前赢/输。 In the example case, the 'prev_wins' variable would be [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. 在示例情况下，'prev_wins'变量将是[0,0,0,1,0,0,0,0,0,0]。 I did manage to create working code for this, which looks like this: 我确实设法为此创建了工作代码，如下所示：

data['prev_wins_spec_challenger'] = 0
data['prev_losses_spec_challenger'] = 0               

data['challenger'] = data['challenger'].astype(str)
data['opponent'] = data['opponent'].astype(str)

data['matchups'] = data['challenger'] + '-' + data['opponent']

# create list of matchups with unique pairings
matchups_temp = list(data['matchups'].unique())
matchups = []
for match in matchups_temp:
    if match[::-1] in matchups:
        pass
    else:
        matchups.append(match)

prev_wins = {}
for i in matchups:
    prev_wins[i] = 0

prev_losses = {}
for i in matchups:
    prev_losses[i] = 0

# go through data set for each matchup and calculate variables
for i in range(0, len(matchups)):
    match = matchups[i].split('-')
    challenger = match[0]
    opponent = match[1]
    for index, row in data.iterrows():
        if row['challenger'] == challenger and row['opponent'] == opponent:
            if row['outcome'] == 'won':
                data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
                prev_wins[matchups[i]] += 1
            elif row['outcome'] == 'lost':
                data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
                prev_losses[matchups[i]] += 1
        elif row['challenger'] == opponent and row['opponent'] == challenger:
            if row['outcome'] == 'won':
                data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
                prev_losses[matchups[i]] += 1
            elif row['outcome'] == 'lost':
                data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
                prev_wins[matchups[i]] += 1

The problem with this is that it takes incredibly long cause there are a total of ~65.000 different matchups and the data frame has ~170.000 rows. 这个问题是它需要非常长的时间，因为总共有大约65,000个不同的比赛，而且数据帧有~170.000行。 On my laptop this would take around 180 hours to run, which is not acceptable. 在我的笔记本电脑上运行大约需要180个小时，这是不可接受的。

I am sure there is a better solution for this but even after searching the internet the whole day I was not able to find one. 我相信有一个更好的解决方案，但即使在搜索互联网一整天后我都找不到。 How can I make this code faster? 如何让这段代码更快？

Answer 1

IIUC, groupby and cumsum IIUC， groupby和cumsum

df['outcome'] = df.outcome.map({'win':1, 'loss':0})

Then 然后

df.groupby('challenger').outcome.cumsum().sub(1).clip(lower=0)

Of course, you don't need to overwrite the values in outcome (you can create a new column and work with it). 当然，您不需要覆盖outcome的值（您可以创建一个新列并使用它）。 But usually in pandas operations are way faster when working with int s than when working with string s. 但通常在使用int时，pandas操作比使用string s时更快。 So from a performance point-of-view, it is preferable to have 0 and 1 representing wins and losses than having the actual words loss and win . 因此，从绩效的角度来看，最好是0和1代表胜负，而不是实际的话语loss和win 。

In the last layer, just when you are presenting the information, that's when you map back to human-understandable words. 在最后一层，就在您展示信息的时候，那就是当您映射回人类可理解的单词时。 But the processing don't usually need strings 但是处理通常不需要字符串

Answer 2

IIUC, you can do something like this, using shift() to look at the previous outcomes, and getting the cumulative sum of the boolean of where it is equal to win : IIUC，你可以做这样的事情，使用shift()来查看先前的结果，并获得它等于win的布尔值的累积和：

data['previous_wins'] = data.groupby('challenger').outcome.transform(lambda x: x.shift().eq('win').cumsum())

>>> data
   challenger      date  opponent outcome  previous_wins
0           5  21.01.96         6     win              0
1           4  22.02.96         3    loss              0
2           5  23.02.96         6     win              1

If you're looking to count how many wins a challenger had against a specific opponent, you can just groupby both the challenger and opponent: 如果你想要计算一个挑战者对特定对手的胜利数量，你可以将挑战者和对手分组：

data['previous_wins'] = data.groupby(['opponent','challenger']).outcome.transform(lambda x: x.shift().eq('win').cumsum())

有没有办法更有效地浏览pandas数据框中的行？

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-12-07 20:10:07

解决方案2
0 2018-12-07 19:56:30

有没有办法更有效地浏览pandas数据框中的行？

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-12-07 20:10:07

解决方案2 0 2018-12-07 19:56:30

解决方案1
2 已采纳 2018-12-07 20:10:07

解决方案2
0 2018-12-07 19:56:30