I'm building a linear regression model to examine the relationship between variables from this dataset . It contained some 'XXXXXXX' values so first I've replaced them with NaNs:
df = df.replace(['XXXXXXX'], ['NaN'])
Then I examined the number of values in each column:
df.count(axis=0)
It appeared that the number of values varied from column to column:
season 200
river size 200
fluid velocity 200
chemical_1 199
chemical_2 198
chemical_3 190
chemical_4 198
chemical_5 198
chemical_6 198
chemical_7 198
chemical_8 188
algae_1 183
algae_2 183
algae_3 183
algae_4 183
algae_5 200
algae_6 200
algae_7 183
If I fill NaN's with median values like this df = df.fillna(df.median(axis=0), axis=0)
each column gets 200 values and I'm able to perform further analysis.
However I want to use another approach and drop NaNs so that each column has the same number of values. When I'm trying df.dropna()
, the count of values in each column stays different. And I'm not able to make the regression analysis.
What should be the right approach in order to drop NaNs and keep the number of values in each column equal?
Instead of ['NaN']
, use numpy.nan
import numpy as np
df = df.replace(['XXXXXXX'], np.nan)
Then df.dropna()
should work just fine.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.