Dataframe 增加了 for 循環設置列值的速度

Question

我有來自熊貓的數據框（將熊貓導入為 pd）

我想在上升沿后在“C3”中計數 +1（上升沿從 C1 =1 和 C2=0 開始）我嘗試使用 iterrow()

count=1
df['C3'] = 0
for index, row in df.iterrows():
        if (row.C1 == 1) and (row.C2 == 0):
               count += 1 
               df.at[index, 'C3'] = count
        else:
               df.at[index, 'C3'] = count

    print(df)
         C1  C2  C3
    0    0   0   1
    1    0   0   1
    2    1   0   2
    3    1   1   2
    4    1   1   2
    5    0   1   2
    6    0   0   2
    7    0   0   2
    8    0   0   2
    9    1   0   3
    10   1   1   3
    11   1   1   3
    12   0   1   3
    13   0   0   3

對於具有 300000 行的數據幀，它有點慢，是否有一種簡單的方法可以使其更快？

非常感謝你的幫助！

Answer 1

你可以：

創建一個系列counts ，它是您想要的條件的布爾掩碼（ counts ）；
將C3添加到值為1 + counts.cumsum()的原始 df

注意：pandas 根據索引值而不是 order將系列連接到數據幀。 做一些破壞 df 或counts系列的中間操作會產生意想不到的結果。

代碼：

counts = (df.C1 == 1) & (df.C2 == 0)
df["C3"] = 1 + counts.cumsum()

結果：

    C1  C2  C3
0    0   0   1
1    0   0   1
2    1   0   2
3    1   1   2
4    1   1   2
5    0   1   2
6    0   0   2
7    0   0   2
8    0   0   2
9    1   0   3
10   1   1   3
11   1   1   3
12   0   1   3
13   0   0   3

表現

讓我們比較三個選項的性能： iterrows 、 df.apply和上面的矢量化解決方案：

df = pd.DataFrame(dict(C1=np.random.choice(2,size=100000), C2=np.random.choice(2,size=100000)))

df1 = df.copy(deep=True)
df2 = df.copy(deep=True)
df3 = df.copy(deep=True)

def use_iterrows():
    count=1
    df1['C3'] = 0
    for index, row in df1.iterrows():
        if (row.C1 == 1) and (row.C2 == 0):
               count += 1 
               df1.at[index, 'C3'] = count
        else:
               df1.at[index, 'C3'] = count

def use_apply():
    df2['C3'] = df2.apply(lambda x: x['C1']==1 and x['C2']==0, axis=1).cumsum()+1

def use_vectorized():
    counts = (df3.C1 == 1) & (df3.C2 == 0)
    df3["C3"] = 1 + counts.cumsum()

%timeit use_iterrows()
# 8.23 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    

%timeit use_apply()
# 1.54 s ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit use_vectorized()
# 1.28 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

總結：使用矢量化函數是迄今為止最快的（對於 100k 行的 df，速度快約 1000 倍（！））。 我喜歡盡可能保持使用矢量化解決方案的習慣。 df.apply的優勢在於它非常靈活，可以在難以進行矢量化操作的情況下工作。 我不認為我曾經需要 iterrows。

Answer 2

簡短的回答：

df['C3'] = df.apply(lambda x: x['C1']==1 and x['C2']==0, axis=1).cumsum()+1

想要的結果：

   C1   C2  C3
0   0   0   1
1   0   0   1
2   1   0   2
3   1   1   2
4   1   1   2
5   0   1   2
6   0   0   2
7   0   0   2
8   0   0   2
9   1   0   3
10  1   1   3
11  1   1   3
12  0   1   3
13  0   0   3

你需要記住的：

當涉及到長數據時，不要使用 iterrows 。 它使它顯着變慢。

apply -- 比 iterrows 更好的替代品，而且效率更高

Dataframe 增加了 for 循環設置列值的速度

問題描述

2 個解決方案

解決方案1
2 已采納 2020-10-12 22:24:51

表現

解決方案2
0 2020-10-12 22:37:50

Dataframe 增加了 for 循環設置列值的速度

問題描述

2 個解決方案

解決方案1 2 已采納 2020-10-12 22:24:51

表現

解決方案2 0 2020-10-12 22:37:50

解決方案1
2 已采納 2020-10-12 22:24:51

解決方案2
0 2020-10-12 22:37:50