[英]Get Trend/Streak in Each Row of Pandas DataFrame
我有一個Pandas DataFrame:
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0],
['D', -0.1, -1.0, -4.0, -3.3, -1.0],
['E', np.nan, np.nan, np.nan, np.nan, np.nan],
['F', 4.0, np.nan, np.nan, np.nan, np.nan]
], columns=['Group', '1', '2', '3', '4', '5'])
Group 1 2 3 4 5
0 A 0.1 2.0 1.0 0.5 0.3
1 B -0.3 -0.4 0.1 0.2 -1.0
2 C 0.1 -1.0 4.0 -3.3 1.0
3 D -0.1 -1.0 -4.0 -3.3 -1.0
4 E NaN NaN NaN NaN NaN
5 F 4.0 NaN NaN NaN NaN
對於每一行,我想返回從左到右的連續正/負值的趨勢/條紋。 因此,最終的DataFrame應該是:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1
第一行的條紋為+5,因為從左到右的值都是正數。 第二行的條紋為負-2,因為前兩列的值為負,並且該條紋在第3列的結尾為正值。第三行的條紋為+1,因為第二列的符號與第一列相反柱。 第四行全為NaN,因此條紋為零。
這有點麻煩,但是似乎可以完成您需要的所有操作:
def streak(row):
cols = row.keys()
n_cols = len(cols)
neg_streak = 0
pos_streak = 0
i_neg_streak = n_cols
i_pos_streak = n_cols
for icol_1 in range(n_cols - 1):
for icol_2 in range(icol_1, n_cols):
if (row.ix[icol_1: icol_2 + 1] < 0).all():
streak = icol_1 - icol_2 - 1
if streak < neg_streak:
neg_streak = streak
i_neg_streak = icol_1
elif (row.ix[icol_1: icol_2 + 1] > 0).all():
streak = 1 + icol_2 - icol_1
if streak > pos_streak:
pos_streak = streak
i_pos_streak = icol_1
if pos_streak == abs(neg_streak):
if i_pos_streak < i_neg_streak:
return pos_streak
else:
return neg_streak
elif pos_streak > abs(neg_streak):
return pos_streak
else:
return neg_streak
df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
['B', -0.3, -0.4, 0.1, 0.2, -1.0],
['C', 0.1, -1.0, 4.0, -3.3, 1.0]
], columns=['Group', '1', '2', '3', '4', '5'])
df = df.set_index('Group')
df['Streak'] = df.apply(lambda row: streak(row), axis = 1)
df = df.reset_index()
print df
我假設您想要最長的連勝紀錄。 不能對聯系做出任何承諾...這個答案使用itertools.groupby
。 首先,在幕后,您可以看到groupby在做什么:
In [4]: b = [-0.3, -0.4, 0.1, 0.2, -1.0]
for k,g in groupby(b, key=lambda x: x > 0.0):
print k,list(g)
False [-0.3, -0.4]
True [0.1, 0.2]
False [-1.0]
現在,利用分組將其包裝在一個函數中:
def streak(dfrow):
longest= 0
for k,g in groupby(dfrow, key=lambda x: False if x<0 else True if x>0 else np.nan):
cur_streak = len(list(g))
if np.isnan(k):
continue
if k: #group is positive
if abs(longest) < cur_streak:
longest= cur_streak
else: #group is negative
if abs(longest) < cur_streak:
longest= -1*cur_streak #multiply by -1
return longest
使用df.apply
將功能應用於每一行:
In [6]: df.set_index('Group',inplace=True)
df['LongestStreak'] = df.apply(streak, axis=1)
結果:
In [281]: df
Out[281]: 1 2 3 4 5 LongestStreak
Group
A 0.1 2.0 1.0 0.5 0.3 5
B -0.3 -0.4 0.1 0.2 -1.0 -2
C 0.1 -1.0 4.0 -3.3 1.0 1
編輯
已更新以解決新的DataFrame並添加了基准,您的擴展性可能會更好,但是我不知道如何修改代碼以生成結果。
結果:
%%timeit
df['LongestStreak'] = df.apply(streak, axis=1)
1000 loops, best of 3: 473 µs per loop
%%timeit
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:]
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1)
df['Streak'] = np.argmin(diff, axis=1) + 1
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
100 loops, best of 3: 2.94 ms per loop
這可以達到目的,並且更加直觀/矢量化
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:] # Compare values from neighboring columns
所以diff
看起來像這樣:
[[ True True True True]
[ True False True False]
[False False False False]
[ True True True True]]
然后,
false_col = np.zeros((a.shape[0], 1), dtype=bool) # Create a column of False
diff = np.concatenate((diff, false_col), axis=1) # Add False column to end of diff
[[ True True True True False]
[ True False True False False]
[False False False False False]
[ True True True True False]]
接下來,我們通過尋找False
的首次出現來尋找True
的條紋:
df['Streak'] = np.argmin(diff, axis=1) + 1 # Add 1 to the index get the streak
最后,我們根據第一列的符號來調整條紋值的符號:
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)
最終的DataFrame如下所示:
Group 1 2 3 4 5 Streak
0 A 0.1 2.0 1.0 0.5 0.3 5
1 B -0.3 -0.4 0.1 0.2 -1.0 -2
2 C 0.1 -1.0 4.0 -3.3 1.0 1
3 D -0.1 -1.0 -4.0 -3.3 -1.0 -5
4 E NaN NaN NaN NaN NaN 0
5 F 4.0 NaN NaN NaN NaN 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.