简体   繁体   English

有没有办法更有效地浏览pandas数据框中的行?

[英]Is there a way to go through rows in a pandas data frame more efficiently?

I have a huge pandas data frame where each row corresponds to a single sports match. 我有一个庞大的熊猫数据框,其中每一行对应一个体育比赛。 It looks like the following: 它看起来如下:

**EDIT: I'll change the example code to better reflect the actual data: This made me realize the presence of values other than 'lost' or 'won' makes this a lot more difficult. **编辑:我将更改示例代码以更好地反映实际数据:这让我意识到“丢失”或“赢”之外的值的存在使得这更加困难。

d = {'date': ['21.01.96', '22.02.96', '23.02.96', '24.02.96', '25.02.96',
          '26.02.96', '27.02.96', '28.02.96', '29.02.96', '30.02.96'], 
     'challenger': [5, 5, 10, 5, 4, 5, 8, 8, 10, 8],
     'opponent': [2, 4, 5, 4, 5, 10, 5, 2, 4, 10],
     'outcome': ['win', 'lost', 'declined', 'win', 'declined', 'win', 'declined', 'declined', 'lost', 'lost']
     }
df = pd.DataFrame(data=d)

For each matchup I want to calculate previous wins/losses in a new variable. 对于每场比赛,我想计算一个新变量的先前赢/输。 In the example case, the 'prev_wins' variable would be [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. 在示例情况下,'prev_wins'变量将是[0,0,0,1,0,0,0,0,0,0]。 I did manage to create working code for this, which looks like this: 我确实设法为此创建了工作代码,如下所示:

data['prev_wins_spec_challenger'] = 0
data['prev_losses_spec_challenger'] = 0               

data['challenger'] = data['challenger'].astype(str)
data['opponent'] = data['opponent'].astype(str)

data['matchups'] = data['challenger'] + '-' + data['opponent']

# create list of matchups with unique pairings
matchups_temp = list(data['matchups'].unique())
matchups = []
for match in matchups_temp:
    if match[::-1] in matchups:
        pass
    else:
        matchups.append(match)

prev_wins = {}
for i in matchups:
    prev_wins[i] = 0

prev_losses = {}
for i in matchups:
    prev_losses[i] = 0

# go through data set for each matchup and calculate variables
for i in range(0, len(matchups)):
    match = matchups[i].split('-')
    challenger = match[0]
    opponent = match[1]
    for index, row in data.iterrows():
        if row['challenger'] == challenger and row['opponent'] == opponent:
            if row['outcome'] == 'won':
                data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
                prev_wins[matchups[i]] += 1
            elif row['outcome'] == 'lost':
                data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
                prev_losses[matchups[i]] += 1
        elif row['challenger'] == opponent and row['opponent'] == challenger:
            if row['outcome'] == 'won':
                data['prev_losses_spec_challenger'][index] = prev_losses[matchups[i]]
                prev_losses[matchups[i]] += 1
            elif row['outcome'] == 'lost':
                data['prev_wins_spec_challenger'][index] = prev_wins[matchups[i]]
                prev_wins[matchups[i]] += 1

The problem with this is that it takes incredibly long cause there are a total of ~65.000 different matchups and the data frame has ~170.000 rows. 这个问题是它需要非常长的时间,因为总共有大约65,000个不同的比赛,而且数据帧有~170.000行。 On my laptop this would take around 180 hours to run, which is not acceptable. 在我的笔记本电脑上运行大约需要180个小时,这是不可接受的。

I am sure there is a better solution for this but even after searching the internet the whole day I was not able to find one. 我相信有一个更好的解决方案,但即使在搜索互联网一整天后我都找不到。 How can I make this code faster? 如何让这段代码更快?

IIUC, groupby and cumsum IIUC, groupbycumsum

df['outcome'] = df.outcome.map({'win':1, 'loss':0})

Then 然后

df.groupby('challenger').outcome.cumsum().sub(1).clip(lower=0)

Of course, you don't need to overwrite the values in outcome (you can create a new column and work with it). 当然,您不需要覆盖outcome的值(您可以创建一个新列并使用它)。 But usually in pandas operations are way faster when working with int s than when working with string s. 但通常在使用int时,pandas操作比使用string s时更快。 So from a performance point-of-view, it is preferable to have 0 and 1 representing wins and losses than having the actual words loss and win . 因此,从绩效的角度来看,最好是01代表胜负,而不是实际的话语losswin

In the last layer, just when you are presenting the information, that's when you map back to human-understandable words. 在最后一层,就在您展示信息的时候,那就是当您映射回人类可理解的单词时。 But the processing don't usually need strings 但是处理通常不需要字符串

IIUC, you can do something like this, using shift() to look at the previous outcomes, and getting the cumulative sum of the boolean of where it is equal to win : IIUC,你可以做这样的事情,使用shift()来查看先前的结果,并获得它等于win的布尔值的累积和:

data['previous_wins'] = data.groupby('challenger').outcome.transform(lambda x: x.shift().eq('win').cumsum())

>>> data
   challenger      date  opponent outcome  previous_wins
0           5  21.01.96         6     win              0
1           4  22.02.96         3    loss              0
2           5  23.02.96         6     win              1

If you're looking to count how many wins a challenger had against a specific opponent, you can just groupby both the challenger and opponent: 如果你想要计算一个挑战者对特定对手的胜利数量,你可以将挑战者和对手分组:

data['previous_wins'] = data.groupby(['opponent','challenger']).outcome.transform(lambda x: x.shift().eq('win').cumsum())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM