![](/img/trans.png)
[英]How to delete rows from column 1 of dataframe based on condition from column 2?
[英]How to delete a column in pandas dataframe based on a condition?
我有一個 Pandas DataFrame,里面有很多NAN
值。
如何刪除number_of_na_values > 2000
列?
我試着這樣做:
toRemove = set()
naNumbersPerColumn = df.isnull().sum()
for i in naNumbersPerColumn.index:
if(naNumbersPerColumn[i]>2000):
toRemove.add(i)
for i in toRemove:
df.drop(i, axis=1, inplace=True)
有沒有更優雅的方法來做到這一點?
這是保留每列中小於或等於指定數量的 nan 的列的另一種選擇:
max_number_of_nas = 3000
df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]
在我的測試中,這似乎比李建勛在我測試的情況下建議的 drop columns 方法稍微快一些(如下所示)。 但是,我應該注意到,如果您根本不使用 apply 方法(例如df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
,性能會變得更加相似df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
)。 只是提醒一下,當談到熊貓矢量化的性能時, 幾乎總是勝過 apply 。
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10000,5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5010
%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1.1 ms ± 4.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit c = df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
>> 1.3 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 2.11 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
性能通常隨數據大小而變化,因此不要忘記檢查最接近您數據的情況。
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5
%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 755 µs ± 4.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit c = df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
>> 777 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 1.71 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
相同的邏輯,但只是將所有內容放在一行中。
import pandas as pd
import numpy as np
# artificial data
# ====================================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDE'))
df[df < 0] = np.nan
A B C D E
0 1.7641 0.4002 0.9787 2.2409 1.8676
1 NaN 0.9501 NaN NaN 0.4106
2 0.1440 1.4543 0.7610 0.1217 0.4439
3 0.3337 1.4941 NaN 0.3131 NaN
4 NaN 0.6536 0.8644 NaN 2.2698
5 NaN 0.0458 NaN 1.5328 1.4694
6 0.1549 0.3782 NaN NaN NaN
7 0.1563 1.2303 1.2024 NaN NaN
8 NaN NaN NaN 1.9508 NaN
9 NaN NaN 0.7775 NaN NaN
# processing: drop columns with no. of NaN > 3
# ====================================
df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > 3)], axis=1)
Out[183]:
B
0 0.4002
1 0.9501
2 1.4543
3 1.4941
4 0.6536
5 0.0458
6 0.3782
7 1.2303
8 NaN
9 NaN
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.