簡體   English   中英

在Pandas DataFrame的每一行中獲取趨勢/條紋

[英]Get Trend/Streak in Each Row of Pandas DataFrame

我有一個Pandas DataFrame:

df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
                   ['B', -0.3, -0.4, 0.1, 0.2, -1.0],
                   ['C', 0.1, -1.0, 4.0, -3.3, 1.0],
                   ['D', -0.1, -1.0, -4.0, -3.3, -1.0],
                   ['E', np.nan, np.nan, np.nan, np.nan, np.nan],
                   ['F', 4.0, np.nan, np.nan, np.nan, np.nan]
                  ], columns=['Group', '1', '2', '3', '4', '5'])


  Group    1    2    3    4    5  
0     A  0.1  2.0  1.0  0.5  0.3  
1     B -0.3 -0.4  0.1  0.2 -1.0  
2     C  0.1 -1.0  4.0 -3.3  1.0  
3     D -0.1 -1.0 -4.0 -3.3 -1.0  
4     E  NaN  NaN  NaN  NaN  NaN  
5     F  4.0  NaN  NaN  NaN  NaN  

對於每一行,我想返回從左到右的連續正/負值的趨勢/條紋。 因此,最終的DataFrame應該是:

  Group    1    2    3    4    5  Streak  
0     A  0.1  2.0  1.0  0.5  0.3       5   
1     B -0.3 -0.4  0.1  0.2 -1.0      -2   
2     C  0.1 -1.0  4.0 -3.3  1.0       1   
3     D -0.1 -1.0 -4.0 -3.3 -1.0      -5   
4     E  NaN  NaN  NaN  NaN  NaN       0    
5     F  4.0  NaN  NaN  NaN  NaN       1 

第一行的條紋為+5,因為從左到右的值都是正數。 第二行的條紋為負-2,因為前兩列的值為負,並且該條紋在第3列的結尾為正值。第三行的條紋為+1,因為第二列的符號與第一列相反柱。 第四行全為NaN,因此條紋為零。

這有點麻煩,但是似乎可以完成您需要的所有操作:

def streak(row):

    cols = row.keys()    
    n_cols = len(cols)

    neg_streak = 0
    pos_streak = 0
    i_neg_streak = n_cols
    i_pos_streak = n_cols

    for icol_1 in range(n_cols - 1):
        for icol_2 in range(icol_1, n_cols):
            if (row.ix[icol_1: icol_2 + 1] < 0).all():
                streak = icol_1 - icol_2 - 1
                if streak < neg_streak:
                    neg_streak = streak
                    i_neg_streak = icol_1
            elif (row.ix[icol_1: icol_2 + 1] > 0).all():
                streak = 1 + icol_2 - icol_1
                if streak > pos_streak:
                    pos_streak = streak
                    i_pos_streak = icol_1

    if pos_streak == abs(neg_streak):
        if i_pos_streak < i_neg_streak:
            return pos_streak
        else:
            return neg_streak
    elif pos_streak > abs(neg_streak):
        return pos_streak
    else:
        return neg_streak

df = pd.DataFrame([['A', 0.1, 2.0, 1.0, 0.5, 0.3],
                   ['B', -0.3, -0.4, 0.1, 0.2, -1.0],
                   ['C', 0.1, -1.0, 4.0, -3.3, 1.0]
                   ], columns=['Group', '1', '2', '3', '4', '5'])

df = df.set_index('Group')
df['Streak'] = df.apply(lambda row: streak(row), axis = 1)
df = df.reset_index()

print df

我假設您想要最長的連勝紀錄。 不能對聯系做出任何承諾...這個答案使用itertools.groupby 首先,在幕后,您可以看到groupby在做什么:

In [4]: b = [-0.3, -0.4, 0.1, 0.2, -1.0]
        for k,g in groupby(b, key=lambda x: x > 0.0):
           print k,list(g)

False [-0.3, -0.4]
True [0.1, 0.2]
False [-1.0]

現在,利用分組將其包裝在一個函數中:

def streak(dfrow):
    longest= 0
    for k,g in groupby(dfrow, key=lambda x: False if x<0 else True if x>0 else np.nan):
        cur_streak = len(list(g))
        if np.isnan(k):
            continue
        if k: #group is positive
            if abs(longest) < cur_streak:
                longest= cur_streak
        else: #group is negative
            if abs(longest) < cur_streak:
                longest= -1*cur_streak #multiply by -1
    return longest

使用df.apply將功能應用於每一行:

In [6]: df.set_index('Group',inplace=True)
        df['LongestStreak'] = df.apply(streak, axis=1)

結果:

In [281]: df
Out[281]:       1   2   3   4   5   LongestStreak
        Group                       
          A     0.1     2.0     1.0     0.5     0.3     5
          B     -0.3    -0.4    0.1     0.2     -1.0    -2
          C     0.1     -1.0    4.0     -3.3    1.0     1

編輯

已更新以解決新的DataFrame並添加了基准,您的擴展性可能會更好,但是我不知道如何修改代碼以生成結果。

結果:

%%timeit
df['LongestStreak'] = df.apply(streak, axis=1)

1000 loops, best of 3: 473 µs per loop


%%timeit
a = (df[['1', '2', '3', '4', '5']] >= 0).values # Get True/False values
diff = a[:, :-1] == a[:, 1:]
false_col = np.zeros((a.shape[0], 1), dtype=bool)  # Create a column of False
diff = np.concatenate((diff, false_col), axis=1)
df['Streak'] = np.argmin(diff, axis=1) + 1
df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)

100 loops, best of 3: 2.94 ms per loop

這可以達到目的,並且更加直觀/矢量化

a = (df[['1', '2', '3', '4', '5']] >= 0).values  # Get True/False values
diff = a[:, :-1] == a[:, 1:]  # Compare values from neighboring columns

所以diff看起來像這樣:

[[ True  True  True  True]
 [ True False  True False]
 [False False False False]
 [ True  True  True  True]]

然后,

false_col = np.zeros((a.shape[0], 1), dtype=bool)  # Create a column of False
diff = np.concatenate((diff, false_col), axis=1)  # Add False column to end of diff

[[ True  True  True  True False]
 [ True False  True False False]
 [False False False False False]
 [ True  True  True  True False]]

接下來,我們通過尋找False的首次出現來尋找True的條紋:

df['Streak'] = np.argmin(diff, axis=1) + 1  # Add 1 to the index get the streak

最后,我們根據第一列的符號來調整條紋值的符號:

df['Sign'] = df['1']
df['Sign'] = np.where(df['Sign'] > 0, 1, df['Sign'])
df['Sign'] = np.where(df['Sign'] < 0, -1, df['Sign'])
df['Sign'] = np.where(df['Sign'].isnull(), 0, df['Sign'])
df['Streak'] = df['Streak'] * df['Sign']
df['Streak'] = df['Streak'].astype(int)
df.drop('Sign', axis=1, inplace=True)

最終的DataFrame如下所示:

  Group    1    2    3    4    5  Streak  
0     A  0.1  2.0  1.0  0.5  0.3       5  
1     B -0.3 -0.4  0.1  0.2 -1.0      -2  
2     C  0.1 -1.0  4.0 -3.3  1.0       1  
3     D -0.1 -1.0 -4.0 -3.3 -1.0      -5  
4     E  NaN  NaN  NaN  NaN  NaN       0  
5     F  4.0  NaN  NaN  NaN  NaN       1  

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM