根據特定列中的值將函數應用於熊貓中的數據框行

Question

假設我有pandas數據框，其中第一列是閾值：

threshold,value1,value2,value3,...,valueN
5,12,3,4,...,20
4,1,7,8,...,3
7,5,2,8,...,10

對於每一行，我希望將value1..valueN列中的元素設置為零（如果小於threshold ：

threshold,value1,value2,value3,...,valueN
5,12,0,0,...,20
4,0,7,8,...,0
7,0,0,8,...,10

沒有顯式的for循環怎么辦？

Answer 1

您可以通過以下方式嘗試：

df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: np.where(x > df.threshold, x, 0), axis=0)

Answer 2

使用DataFrame.lt與mask比較：

df = df.mask(df.lt(df['threshold'], axis=0), 0)

或set_index和reset_index ：

df = df.set_index('threshold')
df = df.mask(df.lt(df.index, axis=0), 0).reset_index()

為了提高性能numpy solution ：

arr = df.values
df = pd.DataFrame(np.where(arr < arr[:, 0][:, None], 0, arr), columns=df.columns)

print (df)
   threshold  value1  value2  value3  valueN
0          5      12       0       0      20
1          4       0       7       8       0
2          7       0       0       8      10

時間：

In [294]: %timeit set_reset_sol(df)
1 loop, best of 3: 376 ms per loop

In [295]: %timeit numpy_sol(df)
10 loops, best of 3: 59.9 ms per loop

In [296]: %timeit df.mask(df.lt(df['threshold'], axis=0), 0)
1 loop, best of 3: 380 ms per loop

In [297]: %timeit df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: np.where(x > df.threshold, x, 0), axis=0)
1 loop, best of 3: 449 ms per loop


np.random.seed(234)
N = 100000

#[100000 rows x 100 columns] 
df = pd.DataFrame(np.random.randint(100, size=(N, 100)))
df.columns = ['threshold'] + df.columns[1:].tolist()
print (df)

def set_reset_sol(df):
    df = df.set_index('threshold')
    return df.mask(df.lt(df.index, axis=0), 0).reset_index()

def numpy_sol(df):
    arr = df.values
    return pd.DataFrame(np.where(arr < arr[:, 0][:, None], 0, arr), columns=df.columns)

根據特定列中的值將函數應用於熊貓中的數據框行

問題描述

2 個解決方案

解決方案1
2 2018-05-11 09:09:55

解決方案2
1 2018-05-11 08:59:24

根據特定列中的值將函數應用於熊貓中的數據框行

問題描述

2 個解決方案

解決方案1 2 2018-05-11 09:09:55

解決方案2 1 2018-05-11 08:59:24

解決方案1
2 2018-05-11 09:09:55

解決方案2
1 2018-05-11 08:59:24