[英]Efficiently finding consecutive streaks in a pandas DataFrame column?
我有一個類似於下面的 DataFrame:,我想向它添加一個 Streak 列(參見下面的示例):
Date Home_Team Away_Team Winner Streak
2005-08-06 A G A 0
2005-08-06 B H H 0
2005-08-06 C I C 0
2005-08-06 D J J 0
2005-08-06 E K K 0
2005-08-06 F L F 0
2005-08-13 A B A 1
2005-08-13 C D D 1
2005-08-13 E F F 0
2005-08-13 G H H 0
2005-08-13 I J J 0
2005-08-13 K L K 1
2005-08-20 B C B 0
2005-08-20 A D A 2
2005-08-20 G K K 0
2005-08-20 I E E 0
2005-08-20 F H F 2
2005-08-20 J L J 2
2005-08-27 A H A 3
2005-08-27 B F B 1
2005-08-27 J C C 3
2005-08-27 D E D 0
2005-08-27 I K K 0
2005-08-27 L G G 0
2005-09-05 B A A 2
2005-09-05 D C D 1
2005-09-05 F E F 0
2005-09-05 H G H 0
2005-09-05 J I I 0
2005-09-05 K L K 4
DataFrame 是從 2005 年到 2020 年的大約 20 萬行。
現在,我想做的是在 DataFrame 的日期列中找到主隊連續贏得的比賽數。我有一個解決方案,但它太慢了,見下文:
df["Streak"] = 0
def home_streak(x): # x is a row of the DataFrame
"""Keep track of a team's winstreak"""
home_team = x["Home_Team"]
date = x["Date"]
# all previous matches for the home team
home_df = df[(df["Home_Team"] == home_team) | (df["Away_Team"] == home_team)]
home_df = home_df[home_df["Date"] < date].sort_values(by="Date", ascending=False).reset_index()
if len(home_df.index) == 0: # no previous matches for that team, so start streak at 0
return 0
elif home_df.iloc[0]["Winner"] != home_team: # lost the last match
return 0
else: # they won the last game
winners = home_df["Winner"]
streak = 0
for i in winners.index:
if home_df.iloc[i]["Winner"] == home_team:
streak += 1
else: # they lost, return the streak
return streak
df["Streak"] = df.apply(lambda x: home_streak(x), axis = 1)
我怎樣才能加快速度?
我將在這里展示一個基於 numpy 的解決方案。 首先是因為我對 pandas 不是很熟悉,不想做研究,其次是因為 numpy 解決方案應該可以正常工作。
讓我們先看看一個給定的團隊會發生什么。 您的目標是根據一支球隊參加的比賽順序找出其連續獲勝的次數。我將刪除日期列並將您的數據轉換為 numpy 數組供初學者使用:
x = np.array([
['A', 'G', 'A'],
['B', 'H', 'H'],
['C', 'I', 'C'],
['D', 'J', 'J'],
['E', 'K', 'K'],
['F', 'L', 'F'],
['A', 'B', 'A'],
['C', 'D', 'D'],
['E', 'F', 'F'],
['G', 'H', 'H'],
['I', 'J', 'J'],
['K', 'L', 'K'],
['B', 'C', 'B'],
['A', 'D', 'A'],
['G', 'K', 'K'],
['I', 'E', 'E'],
['F', 'H', 'F'],
['J', 'L', 'J']])
您不需要日期,因為您只關心誰玩過,即使他們在一天內玩過多次。 那么讓我們來看看A
隊:
A_played = np.flatnonzero((x[:, :2] == 'A').any(axis=1))
A_won = x[A_played, -1] == 'A'
A_played
是一個索引數組,其元素數與x
中的行數相同。 A_won
是一個掩碼,其元素與np.count_nonzero(A_played)
一樣多; 即A
參加的比賽次數。
找到條紋的大小是一個相當好解決的問題:
streaks = np.diff(np.flatnonzero(np.diff(np.r_[False, A_won, False])))[::2]
您計算掩碼值切換時每對索引之間的差異。 帶有False
的額外填充可確保您知道遮罩切換的方式。 您正在尋找的是基於此計算但需要更多細節,因為您需要累積總和,但在每次運行后重置。 您可以通過在運行后立即將數據的值設置為取反的運行長度來做到這一點:
wins = np.r_[0, A_won, 0] # Notice the int dtype here
switch_indices = np.flatnonzero(np.diff(wins)) + 1
streaks = np.diff(switch_indices)[::2]
wins[switch_indices[1::2]] = -streaks
現在你有一個可調整的數組,其累積和可以直接分配給 output 列:
streak_counts = np.cumsum(wins[:-2])
output = np.zeros((x.shape[0], 2), dtype=int)
# Home streak
home_mask = x[A_played, 0] == 'A'
output[A_played[home_mask], 0] = streak_counts[home_mask]
# Away streak
away_mask = ~home_mask
output[A_played[away_mask], 1] = streak_counts[away_mask]
現在你可以遍歷所有球隊(與比賽總數相比,這應該是一個相當小的數字):
def process_team(data, team, output):
played = np.flatnonzero((data[:, :2] == team).any(axis=1))
won = data[played, -1] == team
wins = np.r_[0, won, 0]
switch_indices = np.flatnonzero(np.diff(wins)) + 1
streaks = np.diff(switch_indices)[::2]
wins[switch_indices[1::2]] = -streaks
streak_counts = np.cumsum(wins[:-2])
home_mask = data[played, 0] == team
away_mask = ~home_mask
output[played[home_mask], 0] = streak_counts[home_mask]
output[played[away_mask], 1] = streak_counts[away_mask]
output = np.empty((x.shape[0], 2), dtype=int)
# Assume every team has been home team at least once.
# If not, x[:, :2].ravel() copies the data and np.unique(x[:, :2]) does too
for team in set(x[:, 0]):
process_team(x, team, output)
優雅的方式:
new_df = (df.reset_index()
.melt(['index', 'Date', 'Winner'])
.assign(win=lambda x: x['value'].eq(x.Winner))
.sort_values('Date')
.assign(cum_wins=lambda x: x.groupby('value')['win'].cumsum())
.assign(cum_wins_prev=lambda x: x.groupby('value')['cum_wins'].shift(fill_value=0))
.pivot_table(index='index', values='cum_wins_prev', columns='variable')
.add_prefix('Streak_')
)
print(new_df)
variable Streak_Away_Team Streak_Home_Team
index
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
5 0.0 0.0
6 0.0 1.0
7 0.0 1.0
8 1.0 0.0
9 1.0 0.0
10 1.0 0.0
11 0.0 1.0
12 1.0 0.0
13 1.0 2.0
14 2.0 0.0
15 0.0 0.0
16 2.0 2.0
17 0.0 2.0
#new_df = df.assign(**new_df) #you could use join or assign
new_df = df.join(new_df)
print(new_df)
Date Home_Team Away_Team Winner Streak_Away_Team Streak_Home_Team
0 2005-08-06 A G A 0.0 0.0
1 2005-08-06 B H H 0.0 0.0
2 2005-08-06 C I C 0.0 0.0
3 2005-08-06 D J J 0.0 0.0
4 2005-08-06 E K K 0.0 0.0
5 2005-08-06 F L F 0.0 0.0
6 2005-08-13 A B A 0.0 1.0
7 2005-08-13 C D D 0.0 1.0
8 2005-08-13 E F F 1.0 0.0
9 2005-08-13 G H H 1.0 0.0
10 2005-08-13 I J J 1.0 0.0
11 2005-08-13 K L K 0.0 1.0
12 2005-08-20 B C B 1.0 0.0
13 2005-08-20 A D A 1.0 2.0
14 2005-08-20 G K K 2.0 0.0
15 2005-08-20 I E E 0.0 0.0
16 2005-08-20 F H F 2.0 2.0
17 2005-08-20 J L J 0.0 2.0
據了解,一支球隊每天比賽不超過一次
次
%%timeit
df["Streak"] = 0
def home_streak(x): # x is a row of the DataFrame
"""Keep track of a team's winstreak"""
home_team = x["Home_Team"]
date = x["Date"]
# all previous matches for the home team
home_df = df[(df["Home_Team"] == home_team) | (df["Away_Team"] == home_team)]
home_df = home_df[home_df["Date"] < date].sort_values(by="Date", ascending=False).reset_index()
if len(home_df.index) == 0: # no previous matches for that team, so start streak at 0
return 0
elif home_df.iloc[0]["Winner"] != home_team: # lost the last match
return 0
else: # they won the last game
winners = home_df["Winner"]
streak = 0
for i in winners.index:
if home_df.iloc[i]["Winner"] == home_team:
streak += 1
else: # they lost, return the streak
return streak
df["Streak"] = df.apply(lambda x: home_streak(x), axis = 1)
66.2 ms ± 9.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
new_df = (df.reset_index()
.melt(['index', 'Date', 'Winner'])
.assign(win=lambda x: x['value'].eq(x.Winner))
.sort_values('Date')
.assign(cum_wins=lambda x: x.groupby('value')['win'].cumsum())
.assign(cum_wins_prev=lambda x: x.groupby('value')['cum_wins'].shift(fill_value=0))
.pivot_table(index='index', values='cum_wins_prev', columns='variable')
.add_prefix('Streak_')
)
new_df=df.assign(**new_df)
29.5 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
想不出pandas
解決方案,但您可以使用ngroup
分配一個組號,然后使用defaultdict
創建組,以便您可以查找累積結果:
from collections import defaultdict
d = defaultdict(lambda: defaultdict(int))
df["group"] = df.groupby("Date").ngroup()
for a, b in zip(df["Winner"], df["group"]):
d[b][a] = 1+d.get(b-1,{}).get(a, 0)
df["Streak"] = [d.get(y-1, {}).get(x, 0) for x, y in zip(df["Home_Team"], df["group"])]
print (df.drop("group", 1))
Date Home_Team Away_Team Winner Streak
0 2005-08-06 A G A 0
1 2005-08-06 B H H 0
2 2005-08-06 C I C 0
3 2005-08-06 D J J 0
4 2005-08-06 E K K 0
5 2005-08-06 F L F 0
6 2005-08-13 A B A 1
7 2005-08-13 C D D 1
8 2005-08-13 E F F 0
9 2005-08-13 G H H 0
10 2005-08-13 I J J 0
11 2005-08-13 K L K 1
12 2005-08-20 B C B 0
13 2005-08-20 A D A 2
14 2005-08-20 G K K 0
15 2005-08-20 I E E 0
16 2005-08-20 F H F 2
17 2005-08-20 J L J 2
18 2005-08-27 A H A 3
19 2005-08-27 B F B 1
20 2005-08-27 J C C 3
21 2005-08-27 D E D 0
22 2005-08-27 I K K 0
23 2005-08-27 L G G 0
24 2005-09-05 B A A 2
25 2005-09-05 D C D 1
26 2005-09-05 F E F 0
27 2005-09-05 H G H 0
28 2005-09-05 J I I 0
29 2005-09-05 K L K 4
修復中!
這可能是最簡單的方法 -
def get_streak(l,m,n):
wins = np.roll(np.cumsum([1 if i==n else 0 for i in l]),1)
wins[0]=0
filts = np.array([1 if i==n else 0 for i in m])
mul = np.multiply(wins, filts)
return mul
streaks = np.zeros((30,)).astype(int)
l = list(df['Winner'])
m = list(df['Home_Team'])
for i in df['Winner'].unique():
streaks += get_streak(l,m,i)
df['streaks'] = streaks
Date Home_Team Away_Team Winner streaks
0 2005-08-06 A G A 0
1 2005-08-06 B H H 0
2 2005-08-06 C I C 0
3 2005-08-06 D J J 0
4 2005-08-06 E K K 0
5 2005-08-06 F L F 0
6 2005-08-13 A B A 1
7 2005-08-13 C D D 1
8 2005-08-13 E F F 0
9 2005-08-13 G H H 0
10 2005-08-13 I J J 0
11 2005-08-13 K L K 1
12 2005-08-20 B C B 0
13 2005-08-20 A D A 2
14 2005-08-20 G K K 0
15 2005-08-20 I E E 0
16 2005-08-20 F H F 2
17 2005-08-20 J L J 2
18 2005-08-27 A H A 3
19 2005-08-27 B F B 1
20 2005-08-27 J C C 3
21 2005-08-27 D E D 1
22 2005-08-27 I K K 0
23 2005-08-27 L G G 0
24 2005-09-05 B A A 2
25 2005-09-05 D C D 2
26 2005-09-05 F E F 3
27 2005-09-05 H G H 2
28 2005-09-05 J I I 3
29 2005-09-05 K L K 4
這很簡單 -
通過一些打印語句可以更直觀地了解 function 是如何工作的——
def get_streak(l,m,n):
wins = np.roll(np.cumsum([1 if i==n else 0 for i in l]),1)
wins[0]=0
print('wins:',wins)
filts = np.array([1 if i==n else 0 for i in m])
print('home:',filts)
mul = np.multiply(wins, filts)
print('strk:', mul)
return mul
streak_A = get_streak(l,m,'A')
wins: [0 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5]
home: [1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
strk: [0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0]
所有條紋的元素總和就是您要尋找的。
基准(似乎是所有其他答案中最快的)-
每個循環 529 µs ± 20.6 µs(7 次運行的平均值 ± 標准偏差,每次 1000 次循環)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.