簡體   English   中英

在 pandas DataFrame 列中有效地找到連續的條紋?

[英]Efficiently finding consecutive streaks in a pandas DataFrame column?

我有一個類似於下面的 DataFrame:,我想向它添加一個 Streak 列(參見下面的示例):

Date         Home_Team    Away_Team    Winner      Streak

2005-08-06       A            G           A           0
2005-08-06       B            H           H           0
2005-08-06       C            I           C           0
2005-08-06       D            J           J           0
2005-08-06       E            K           K           0
2005-08-06       F            L           F           0
2005-08-13       A            B           A           1           
2005-08-13       C            D           D           1           
2005-08-13       E            F           F           0        
2005-08-13       G            H           H           0
2005-08-13       I            J           J           0
2005-08-13       K            L           K           1
2005-08-20       B            C           B           0
2005-08-20       A            D           A           2
2005-08-20       G            K           K           0
2005-08-20       I            E           E           0
2005-08-20       F            H           F           2
2005-08-20       J            L           J           2
2005-08-27       A            H           A           3
2005-08-27       B            F           B           1
2005-08-27       J            C           C           3           
2005-08-27       D            E           D           0
2005-08-27       I            K           K           0
2005-08-27       L            G           G           0
2005-09-05       B            A           A           2
2005-09-05       D            C           D           1
2005-09-05       F            E           F           0
2005-09-05       H            G           H           0
2005-09-05       J            I           I           0
2005-09-05       K            L           K           4

DataFrame 是從 2005 年到 2020 年的大約 20 萬行。

現在,我想做的是在 DataFrame 的日期列中找到主隊連續贏得的比賽數。我有一個解決方案,但它太慢了,見下文:

df["Streak"] = 0
def home_streak(x): # x is a row of the DataFrame
    """Keep track of a team's winstreak"""
    home_team = x["Home_Team"]
    date = x["Date"]
    
    # all previous matches for the home team 
    home_df = df[(df["Home_Team"] == home_team) | (df["Away_Team"] == home_team)]
    home_df = home_df[home_df["Date"] <  date].sort_values(by="Date", ascending=False).reset_index()
    if len(home_df.index) == 0: # no previous matches for that team, so start streak at 0
        return 0
    elif home_df.iloc[0]["Winner"] != home_team: # lost the last match
        return 0
    else: # they won the last game
        winners = home_df["Winner"]
        streak = 0
        for i in winners.index:
            if home_df.iloc[i]["Winner"] == home_team:
                streak += 1
            else: # they lost, return the streak
                return streak

df["Streak"] = df.apply(lambda x: home_streak(x), axis = 1)

我怎樣才能加快速度?

我將在這里展示一個基於 numpy 的解決方案。 首先是因為我對 pandas 不是很熟悉,不想做研究,其次是因為 numpy 解決方案應該可以正常工作。

讓我們先看看一個給定的團隊會發生什么。 您的目標是根據一支球隊參加的比賽順序找出其連續獲勝的次數。我將刪除日期列並將您的數據轉換為 numpy 數組供初學者使用:

x = np.array([
    ['A', 'G', 'A'],
    ['B', 'H', 'H'],
    ['C', 'I', 'C'],
    ['D', 'J', 'J'],
    ['E', 'K', 'K'],
    ['F', 'L', 'F'],
    ['A', 'B', 'A'],
    ['C', 'D', 'D'],
    ['E', 'F', 'F'],
    ['G', 'H', 'H'],
    ['I', 'J', 'J'],
    ['K', 'L', 'K'],
    ['B', 'C', 'B'],
    ['A', 'D', 'A'],
    ['G', 'K', 'K'],
    ['I', 'E', 'E'],
    ['F', 'H', 'F'],
    ['J', 'L', 'J']])

您不需要日期,因為您只關心誰玩過,即使他們在一天內玩過多次。 那么讓我們來看看A隊:

A_played = np.flatnonzero((x[:, :2] == 'A').any(axis=1))
A_won = x[A_played, -1] == 'A'

A_played是一個索引數組,其元素數與x中的行數相同。 A_won是一個掩碼,其元素與np.count_nonzero(A_played)一樣多; A參加的比賽次數。

找到條紋的大小是一個相當好解決的問題:

streaks = np.diff(np.flatnonzero(np.diff(np.r_[False, A_won, False])))[::2]

您計算掩碼值切換時每對索引之間的差異。 帶有False的額外填充可確保您知道遮罩切換的方式。 您正在尋找的是基於此計算但需要更多細節,因為您需要累積總和,但在每次運行后重置。 您可以通過在運行后立即將數據的值設置為取反的運行長度來做到這一點:

wins = np.r_[0, A_won, 0]  # Notice the int dtype here
switch_indices = np.flatnonzero(np.diff(wins)) + 1
streaks = np.diff(switch_indices)[::2]
wins[switch_indices[1::2]] = -streaks

現在你有一個可調整的數組,其累積和可以直接分配給 output 列:

streak_counts = np.cumsum(wins[:-2])
output = np.zeros((x.shape[0], 2), dtype=int)

# Home streak
home_mask = x[A_played, 0] == 'A'
output[A_played[home_mask], 0] = streak_counts[home_mask]

# Away streak
away_mask = ~home_mask
output[A_played[away_mask], 1] = streak_counts[away_mask]

現在你可以遍歷所有球隊(與比賽總數相比,這應該是一個相當小的數字):

def process_team(data, team, output):
    played = np.flatnonzero((data[:, :2] == team).any(axis=1))
    won = data[played, -1] == team
    wins = np.r_[0, won, 0]
    switch_indices = np.flatnonzero(np.diff(wins)) + 1
    streaks = np.diff(switch_indices)[::2]
    wins[switch_indices[1::2]] = -streaks
    streak_counts = np.cumsum(wins[:-2])

    home_mask = data[played, 0] == team
    away_mask = ~home_mask

    output[played[home_mask], 0] = streak_counts[home_mask]
    output[played[away_mask], 1] = streak_counts[away_mask]

output = np.empty((x.shape[0], 2), dtype=int)

# Assume every team has been home team at least once.
# If not, x[:, :2].ravel() copies the data and np.unique(x[:, :2]) does too
for team in set(x[:, 0]):
    process_team(x, team, output)

優雅的方式:

new_df = (df.reset_index()
            .melt(['index', 'Date', 'Winner'])
            .assign(win=lambda x: x['value'].eq(x.Winner))
            .sort_values('Date')
            .assign(cum_wins=lambda x: x.groupby('value')['win'].cumsum())
            .assign(cum_wins_prev=lambda x: x.groupby('value')['cum_wins'].shift(fill_value=0))
            .pivot_table(index='index', values='cum_wins_prev', columns='variable')
            .add_prefix('Streak_')
         )
print(new_df)

variable  Streak_Away_Team  Streak_Home_Team
index                                       
0                      0.0               0.0
1                      0.0               0.0
2                      0.0               0.0
3                      0.0               0.0
4                      0.0               0.0
5                      0.0               0.0
6                      0.0               1.0
7                      0.0               1.0
8                      1.0               0.0
9                      1.0               0.0
10                     1.0               0.0
11                     0.0               1.0
12                     1.0               0.0
13                     1.0               2.0
14                     2.0               0.0
15                     0.0               0.0
16                     2.0               2.0
17                     0.0               2.0

#new_df = df.assign(**new_df) #you could use join or assign 
new_df = df.join(new_df) 
print(new_df)



          Date Home_Team Away_Team Winner  Streak_Away_Team  Streak_Home_Team
0   2005-08-06         A         G      A               0.0               0.0
1   2005-08-06         B         H      H               0.0               0.0
2   2005-08-06         C         I      C               0.0               0.0
3   2005-08-06         D         J      J               0.0               0.0
4   2005-08-06         E         K      K               0.0               0.0
5   2005-08-06         F         L      F               0.0               0.0
6   2005-08-13         A         B      A               0.0               1.0
7   2005-08-13         C         D      D               0.0               1.0
8   2005-08-13         E         F      F               1.0               0.0
9   2005-08-13         G         H      H               1.0               0.0
10  2005-08-13         I         J      J               1.0               0.0
11  2005-08-13         K         L      K               0.0               1.0
12  2005-08-20         B         C      B               1.0               0.0
13  2005-08-20         A         D      A               1.0               2.0
14  2005-08-20         G         K      K               2.0               0.0
15  2005-08-20         I         E      E               0.0               0.0
16  2005-08-20         F         H      F               2.0               2.0
17  2005-08-20         J         L      J               0.0               2.0

據了解,一支球隊每天比賽不超過一次

%%timeit
df["Streak"] = 0
def home_streak(x): # x is a row of the DataFrame
    """Keep track of a team's winstreak"""
    home_team = x["Home_Team"]
    date = x["Date"]
    
    # all previous matches for the home team 
    home_df = df[(df["Home_Team"] == home_team) | (df["Away_Team"] == home_team)]
    home_df = home_df[home_df["Date"] <  date].sort_values(by="Date", ascending=False).reset_index()
    if len(home_df.index) == 0: # no previous matches for that team, so start streak at 0
        return 0
    elif home_df.iloc[0]["Winner"] != home_team: # lost the last match
        return 0
    else: # they won the last game
        winners = home_df["Winner"]
        streak = 0
        for i in winners.index:
            if home_df.iloc[i]["Winner"] == home_team:
                streak += 1
            else: # they lost, return the streak
                return streak

df["Streak"] = df.apply(lambda x: home_streak(x), axis = 1)

66.2 ms ± 9.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit

new_df = (df.reset_index()
            .melt(['index', 'Date', 'Winner'])
            .assign(win=lambda x: x['value'].eq(x.Winner))
            .sort_values('Date')
            .assign(cum_wins=lambda x: x.groupby('value')['win'].cumsum())
            .assign(cum_wins_prev=lambda x: x.groupby('value')['cum_wins'].shift(fill_value=0))
            .pivot_table(index='index', values='cum_wins_prev', columns='variable')
            .add_prefix('Streak_')
         )
new_df=df.assign(**new_df)

29.5 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

想不出pandas解決方案,但您可以使用ngroup分配一個組號,然后使用defaultdict創建組,以便您可以查找累積結果:

from collections import defaultdict

d = defaultdict(lambda: defaultdict(int))

df["group"] = df.groupby("Date").ngroup()

for a, b in zip(df["Winner"], df["group"]):
    d[b][a] = 1+d.get(b-1,{}).get(a, 0)

df["Streak"] = [d.get(y-1, {}).get(x, 0) for x, y in zip(df["Home_Team"], df["group"])]

print (df.drop("group", 1))

          Date Home_Team Away_Team Winner  Streak
0   2005-08-06         A         G      A       0
1   2005-08-06         B         H      H       0
2   2005-08-06         C         I      C       0
3   2005-08-06         D         J      J       0
4   2005-08-06         E         K      K       0
5   2005-08-06         F         L      F       0
6   2005-08-13         A         B      A       1
7   2005-08-13         C         D      D       1
8   2005-08-13         E         F      F       0
9   2005-08-13         G         H      H       0
10  2005-08-13         I         J      J       0
11  2005-08-13         K         L      K       1
12  2005-08-20         B         C      B       0
13  2005-08-20         A         D      A       2
14  2005-08-20         G         K      K       0
15  2005-08-20         I         E      E       0
16  2005-08-20         F         H      F       2
17  2005-08-20         J         L      J       2
18  2005-08-27         A         H      A       3
19  2005-08-27         B         F      B       1
20  2005-08-27         J         C      C       3
21  2005-08-27         D         E      D       0
22  2005-08-27         I         K      K       0
23  2005-08-27         L         G      G       0
24  2005-09-05         B         A      A       2
25  2005-09-05         D         C      D       1
26  2005-09-05         F         E      F       0
27  2005-09-05         H         G      H       0
28  2005-09-05         J         I      I       0
29  2005-09-05         K         L      K       4

修復中!

這可能是最簡單的方法 -

def get_streak(l,m,n):
    wins = np.roll(np.cumsum([1 if i==n else 0 for i in l]),1)
    wins[0]=0
    filts = np.array([1 if i==n else 0 for i in m])
    mul = np.multiply(wins, filts)
    return mul


streaks = np.zeros((30,)).astype(int)
l = list(df['Winner'])
m = list(df['Home_Team'])

for i in df['Winner'].unique():
    streaks += get_streak(l,m,i)
    
df['streaks'] = streaks
          Date Home_Team Away_Team Winner  streaks
0   2005-08-06         A         G      A        0
1   2005-08-06         B         H      H        0
2   2005-08-06         C         I      C        0
3   2005-08-06         D         J      J        0
4   2005-08-06         E         K      K        0
5   2005-08-06         F         L      F        0
6   2005-08-13         A         B      A        1
7   2005-08-13         C         D      D        1
8   2005-08-13         E         F      F        0
9   2005-08-13         G         H      H        0
10  2005-08-13         I         J      J        0
11  2005-08-13         K         L      K        1
12  2005-08-20         B         C      B        0
13  2005-08-20         A         D      A        2
14  2005-08-20         G         K      K        0
15  2005-08-20         I         E      E        0
16  2005-08-20         F         H      F        2
17  2005-08-20         J         L      J        2
18  2005-08-27         A         H      A        3
19  2005-08-27         B         F      B        1
20  2005-08-27         J         C      C        3
21  2005-08-27         D         E      D        1
22  2005-08-27         I         K      K        0
23  2005-08-27         L         G      G        0
24  2005-09-05         B         A      A        2
25  2005-09-05         D         C      D        2
26  2005-09-05         F         E      F        3
27  2005-09-05         H         G      H        2
28  2005-09-05         J         I      I        3
29  2005-09-05         K         L      K        4

這很簡單 -

  1. 您將給定團隊的獲勝總和加起來,然后將它們平移 1。
  2. 然后你將那些與他們是主隊的實例相乘。 將其保存到名為 streak 的向量中
  3. 您遍歷所有獨特的主隊並計算他們的連續上壘總和。
  4. 完畢!

通過一些打印語句可以更直觀地了解 function 是如何工作的——

def get_streak(l,m,n):
    wins = np.roll(np.cumsum([1 if i==n else 0 for i in l]),1)
    wins[0]=0
    print('wins:',wins)
    filts = np.array([1 if i==n else 0 for i in m])
    print('home:',filts)
    mul = np.multiply(wins, filts)
    print('strk:', mul)
    return mul

streak_A = get_streak(l,m,'A')
wins: [0 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5]
home: [1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
strk: [0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0]

所有條紋的元素總和就是您要尋找的。


基准(似乎是所有其他答案中最快的)-

每個循環 529 µs ± 20.6 µs(7 次運行的平均值 ± 標准偏差,每次 1000 次循環)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM