简体   繁体   中英

Remove Columns with missing values above a threshold pandas

I am doing data preprocessing and want to remove features/columns which have more than say 10% missing values.

I have made the below code:

df_missing=df.isna()
result=df_missing.sum()/len(df)
result

Default           0.010066
Income            0.142857
Age               0.109090
Name              0.047000
Gender            0.000000
Type of job       0.200000
Amt of credit     0.850090
Years employed    0.009003
dtype: float64

I want df to have columns only where there are no missing values above 10%.

Expected output:

df

Default   Name   Gender   Years employed

(columns where there were missing values greater than 10% are removed.)

I have tried

result.iloc[:,0] 
IndexingError: Too many indexers

Please help

Because division of sum by length is mean , you can instead df_missing.sum()/len(df) use df_missing.mean() :

result = df.isna().mean()

Then filter by DataFrame.loc with : for all rows and columns by mask:

df = df.loc[:,result > .1]

它应该是df = df.loc[:,result < .1]因为用户只想保留缺少行数少于 10% 的列

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM