[英]Apply function to dataframe row in pandas based on value in specific column
假設我有pandas數據框,其中第一列是閾值:
threshold,value1,value2,value3,...,valueN
5,12,3,4,...,20
4,1,7,8,...,3
7,5,2,8,...,10
對於每一行,我希望將value1..valueN
列中的元素設置為零(如果小於threshold
:
threshold,value1,value2,value3,...,valueN
5,12,0,0,...,20
4,0,7,8,...,0
7,0,0,8,...,10
沒有顯式的for
循環怎么辦?
您可以通過以下方式嘗試:
df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: np.where(x > df.threshold, x, 0), axis=0)
使用DataFrame.lt
與mask
比較:
df = df.mask(df.lt(df['threshold'], axis=0), 0)
df = df.set_index('threshold')
df = df.mask(df.lt(df.index, axis=0), 0).reset_index()
為了提高性能numpy solution
:
arr = df.values
df = pd.DataFrame(np.where(arr < arr[:, 0][:, None], 0, arr), columns=df.columns)
print (df)
threshold value1 value2 value3 valueN
0 5 12 0 0 20
1 4 0 7 8 0
2 7 0 0 8 10
時間 :
In [294]: %timeit set_reset_sol(df)
1 loop, best of 3: 376 ms per loop
In [295]: %timeit numpy_sol(df)
10 loops, best of 3: 59.9 ms per loop
In [296]: %timeit df.mask(df.lt(df['threshold'], axis=0), 0)
1 loop, best of 3: 380 ms per loop
In [297]: %timeit df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: np.where(x > df.threshold, x, 0), axis=0)
1 loop, best of 3: 449 ms per loop
np.random.seed(234)
N = 100000
#[100000 rows x 100 columns]
df = pd.DataFrame(np.random.randint(100, size=(N, 100)))
df.columns = ['threshold'] + df.columns[1:].tolist()
print (df)
def set_reset_sol(df):
df = df.set_index('threshold')
return df.mask(df.lt(df.index, axis=0), 0).reset_index()
def numpy_sol(df):
arr = df.values
return pd.DataFrame(np.where(arr < arr[:, 0][:, None], 0, arr), columns=df.columns)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.