尋找一種更快的方法來根據另一列中的值移動 pandas dataframe 列中的值

Question

當達到另一列中的某個閾值時，我試圖找出一種更好/更快的方法來“前進”一列中的值。 示例 dataframe，其中“Col1”和“Col2”是輸入，“Col4”是所需的 output：

	Col1	Col2	Col3	Col4
0	0.001	0.046667	13	鈉
1	0.002	0.051667	12	鈉
2	0.002	0.056667	11	鈉
3	0.003	0.061667	11	鈉
4	0.004	0.066667	10	鈉
5	0.005	0.073333	10	鈉
6	0.006	0.078333	10	鈉
7	0.007	0.083333	9	鈉
8	0.008	0.086667	9	鈉
9	0.009	0.091667	8	鈉
10	0.009	0.096667	8	鈉
11	0.009	0.100000	8	鈉
12	0.011	0.105000	7	鈉
13	0.012	0.110000	7	0.002
14	0.013	0.116667	6	0.004
15	0.012	0.121667	5	0.005
16	0.011	0.128333	4	0.007
17	0.010	0.136667	3	0.009
18	0.009	0.143333	2	0.009
19	0.008	0.150000	1	0.011

我想找到從當前行開始的“Col2”中的累積總和達到 1 的每一行（將是一個變量）。 然后我想將值從“Col1”移動或復制/粘貼到新列中的那些行 - “Col4”。

'Col3' 是每行需要移動的行數，可以更好地解釋我想要實現的目標。

以下 for 循環可以工作，但在大型數據集上非常慢，處理 10,000 行大約需要 5 秒：

   for x in range(len(df)):
       ind = df.loc[x::, 'Col2'].cumsum().searchsorted(1)
       df.loc[x + ind, 'Col4'] = df.loc[x, 'Col1']

我非常感謝任何幫助加快這一進程。

Answer 1

研究了多種方法
1. 達到限制后停止的自定義cumsum()
2. 使用numba改進 1
3. 創建新陣列的numpy解決方案
移位/偏移更好地表示為索引，因此可以在iloc中使用
如果你想要Col1的值 where cumsum() > n你的樣本看起來不對
性能時序降低...

from numba import jit

NaN = np.nan
df = pd.DataFrame({"Col1": [0.001, 0.002, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.009, 0.009, 0.011, 0.012, 0.013, 0.012, 0.011, 0.01, 0.009, 0.008], 
 "Col2": [0.046667, 0.051667, 0.056667, 0.061667, 0.066667, 0.073333, 0.078333, 0.083333, 0.086667, 0.091667, 0.096667, 0.1, 0.105, 0.11, 0.116667, 0.121667, 0.128333, 0.136667, 0.143333, 0.15], 
 "Col3": [13, 12, 11, 11, 10, 10, 10, 9, 9, 8, 8, 8, 7, 7, 6, 5, 4, 3, 2, 1], 
 "Col4": [NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, 0.002, 0.004, 0.005, 0.007, 0.009, 0.009, 0.011]})

@jit(nopython=True)
def minidxjit(s, n=1):
    t=0
    i=0
    for v in s:
        t+=v
        if t>n: break
        i += 1
    return i if t>n else -1

def minidx(s, n=1):
    t=0
    for i,v in enumerate(s):
        t+=v
        if t>n: break
    return i if t>n else -1


def lkup(dfa, idx=False, n=1, jit=True):
    i = minidxjit(dfa.loc[:,"Col2"].values, n=n) if jit else minidx(dfa.loc[:,"Col2"].values, n=n)
    if idx: return i
    else: return np.nan if i==-1 else dfa.iloc[i,0]

n=1

df.assign(sh=pd.Series(df.index).apply(lambda x: df.iloc[x:,1].cumsum().searchsorted(n)),
          sh2=pd.Series(df.index).apply(lambda x: lkup(df.iloc[x:,], idx=True, n=n)),
          val=pd.Series(df.index).apply(lambda x: lkup(df.iloc[x:,], n=n)),

         )

def faststuff(df, timeit=False):
    d = {"Col3f":[np.argmin(np.cumsum(df.Col2.values[s:])<n) for s in range(len(df))],
        "Col3fr":[np.argmin(np.cumsum(df.Col2.values[s:])<n)+s for s in range(len(df))],
        "Col4f":[df.iloc[np.argmin(np.cumsum(df.Col2.values[s:])<n)+s,0] for s in range(len(df))]}
    if timeit:
        d.pop("Col3f",None)
        d.pop("Col3fr",None)
    return df.assign(**d)

faststuff(df)

樣本數據上的 output

	Col1	Col2	Col3	Col4	Col3f	Col3fr	Col4f
0	0.001	0.046667	13	楠	13	13	0.012
1	0.002	0.051667	12	楠	12	13	0.012
2	0.002	0.056667	11	楠	11	13	0.012
3	0.003	0.061667	11	楠	11	14	0.013
4	0.004	0.066667	10	楠	10	14	0.013
5	0.005	0.073333	10	楠	10	15	0.012
6	0.006	0.078333	10	楠	10	16	0.011
7	0.007	0.083333	9	楠	9	16	0.011
8	0.008	0.086667	9	楠	9	17	0.01
9	0.009	0.091667	8	楠	8	17	0.01
10	0.009	0.096667	8	楠	8	18	0.009
11	0.009	0.1	8	楠	8	19	0.008
12	0.011	0.105	7	楠	7	19	0.008
13	0.012	0.11	7	0.002	0	13	0.012
14	0.013	0.116667	6	0.004	0	14	0.013
15	0.012	0.121667	5	0.005	0	15	0.012
16	0.011	0.128333	4	0.007	0	16	0.011
17	0.01	0.136667	3	0.009	0	17	0.01
18	0.009	0.143333	2	0.009	0	18	0.009
19	0.008	0.15	1	0.011	0	19	0.008

計時碼

dfp=pd.concat([df for i in range(1000)]).reset_index(drop=True)
print(f"rows: {len(dfp)}")

%timeit pd.Series(dfp.index).apply(lambda x: lkup(df.iloc[x:,], n=n, jit=True))
%timeit pd.Series(dfp.index).apply(lambda x: lkup(df.iloc[x:,], n=n, jit=False))
%timeit faststuff(dfp,timeit=True)

計時結果

rows: 20000
6.45 s ± 414 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.69 s ± 549 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.66 s ± 387 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 2

更新：通過將列轉換為 np.arrays 並擺脫 df.loc[]，我能夠實現相當大的性能改進。

我現在的代碼是：

col1 = df['Col1'].values
col2 = df['Col2'].values
df['Col4'] = pd.Series(data=([np.nan] * len(df)))
col4 = df['Col4'].values
k = len(df)
n = 1
for x in range(k):
    ind = col2[x::].cumsum().searchsorted(n)
    if x+ind >= k-1: break 
    col4[x+ind] = col1[x]

尋找一種更快的方法來根據另一列中的值移動 pandas dataframe 列中的值

問題描述

2 個解決方案

解決方案1
0 2021-03-06 17:41:27

樣本數據上的 output

計時碼

計時結果

解決方案2
0 已采納 2021-03-10 01:14:17

尋找一種更快的方法來根據另一列中的值移動 pandas dataframe 列中的值

問題描述

2 個解決方案

解決方案1 0 2021-03-06 17:41:27

樣本數據上的 output

計時碼

計時結果

解決方案2 0 已采納 2021-03-10 01:14:17

解決方案1
0 2021-03-06 17:41:27

解決方案2
0 已采納 2021-03-10 01:14:17