I am doing data preprocessing and want to remove features/columns which have more than say 10% missing values.
I have made the below code:
df_missing=df.isna()
result=df_missing.sum()/len(df)
result
Default 0.010066
Income 0.142857
Age 0.109090
Name 0.047000
Gender 0.000000
Type of job 0.200000
Amt of credit 0.850090
Years employed 0.009003
dtype: float64
I want df to have columns only where there are no missing values above 10%.
Expected output:
df
Default Name Gender Years employed
(columns where there were missing values greater than 10% are removed.)
I have tried
result.iloc[:,0]
IndexingError: Too many indexers
Please help
Because division of sum by length is mean
, you can instead df_missing.sum()/len(df)
use df_missing.mean()
:
result = df.isna().mean()
Then filter by DataFrame.loc
with :
for all rows and columns by mask:
df = df.loc[:,result > .1]
它应该是df = df.loc[:,result < .1]
因为用户只想保留缺少行数少于 10% 的列
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.