計算大於當前行值但小於其他列值的連續行數

Question

假設我有以下示例數據框（實際數據框中大約有 25k 行）

df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
   A  B
0  0  9
1  3  8
2  2  3
3  9  5
4  1  5
5  0  5
6  4  5
7  7  8
8  3  0
9  2  4

對於列 AI 需要知道有多少下一行和上一行大於當前行值但小於 B 列中的值。

所以我的預期輸出是：

A   B next count  previous count
0   9     2          0
3   8     0          0
2   3     0          1
9   5     0          0
1   5     0          0
0   5     2          1
4   5     1          0
7   8     0          0
3   0     0          2
2   4     0          0

解釋：

第一行計算為：因為 3 和 2 大於 0 但小於相應的 B 值 8 和 3
第二行計算為：因為下一個值 2 不大於 3
第三行計算為：因為 9 大於 2 但不大於其對應的 B 值

同樣，計算previous count

注意：我知道如何通過循環使用列表理解或使用 pandas apply 方法來解決這個問題，但我仍然不介意清晰簡潔的apply方法。 我一直在尋找一種更pandaic的方法。

我的解決方案

這是應用解決方案，我認為它效率低下。 此外，正如人們所說，這個問題可能沒有vector解決方案。 如前所述，對於這個問題，將接受更有效的應用解決方案。

這是我嘗試過的。

此函數獲取滿足條件的前/后行數。

def get_prev_next_count(row):
    next_nrow = df.loc[row['index']+1:,['A', 'B']]
    prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
    if (next_nrow.size == 0):
        return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
    if (prev_nrow.size == 0):
        return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
    return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())

生成輸出：

df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")

輸出：

這給了我們預期的輸出

df
   A  B  next count  previous count
0  0  9           2               0
1  3  8           0               0
2  2  3           0               1
3  9  5           0               0
4  1  5           0               0
5  0  5           2               1
6  4  5           1               0
7  7  8           0               0
8  3  0           0               2
9  2  4           0               0

Answer 1

我做了一些優化：

您不需要reset_index()您可以使用.name訪問索引
如果您只傳遞df[['A']]而不是整個框架，那可能會有所幫助。
prev_nrow.empty與(prev_nrow.size == 0)相同
通過first_false應用不同的邏輯來獲得所需的值，這大大加快了速度。

def first_false(val1, val2, A):
    i = 0
    for x, y in zip(val1, val2):
        if A < x < y:
            i += 1
        else:
            break
    return i

def get_prev_next_count(row):
    A = row['A']
    next_nrow = df.loc[row.name+1:,['A', 'B']]
    prev_nrow = df2.loc[row.name-1:,['A', 'B']]
    if next_nrow.empty:
        return 0, first_false(prev_nrow.A, prev_nrow.B, A)
    if prev_nrow.empty:
        return first_false(next_nrow.A, next_nrow.B, A), 0
    return (first_false(next_nrow.A, next_nrow.B, A),
            first_false(prev_nrow.A, prev_nrow.B, A))

df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~

df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)

輸出：

   A  B  next count  previous count
0  0  9           2               0
1  3  8           0               0
2  2  3           0               1
3  9  5           0               0
4  1  5           0               0
5  0  5           2               1
6  4  5           1               0
7  7  8           0               0
8  3  0           0               2
9  2  4           0               0

定時

擴展數據：

df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)

原始方法：

15分鍾就放棄了。

新方法：

1m 20秒

向它扔pandarallel ：

from pandarallel import pandarallel
pandarallel.initialize()

df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')

26秒

Answer 2

雖然並非所有這些計算都可以用向量完成，但可以使用一些預先計算的值來加快計算速度：

df['in_range'] = df['A'].between(df.index, df['B'])

def count_in_range(row):
    # check how many consecutive values in df['in_range']
    # exist starting from the desired row
    # and ending with the first row where the df['A']
    # value is too big or too small (where df['in_range'])
    # is False
    pass

df.apply(count_in_range, axis=1)

計算大於當前行值但小於其他列值的連續行數

問題描述

我的解決方案

生成輸出：

輸出：

1 個解決方案

解決方案1
1 2022-07-20 05:50:04

定時

解決方案2
0 2022-07-19 16:40:53

計算大於當前行值但小於其他列值的連續行數

問題描述

我的解決方案

生成輸出：

輸出 ：

1 個解決方案

解決方案1 1 2022-07-20 05:50:04

定時

解決方案2 0 2022-07-19 16:40:53

輸出：

解決方案1
1 2022-07-20 05:50:04

解決方案2
0 2022-07-19 16:40:53