How to drop NaNs and get the same number of values in each column in Python?

Question

I'm building a linear regression model to examine the relationship between variables from this dataset . It contained some 'XXXXXXX' values so first I've replaced them with NaNs:

df = df.replace(['XXXXXXX'], ['NaN'])

Then I examined the number of values in each column:

df.count(axis=0)

It appeared that the number of values varied from column to column:

season            200
river size        200
fluid velocity    200
chemical_1        199
chemical_2        198
chemical_3        190
chemical_4        198
chemical_5        198
chemical_6        198
chemical_7        198
chemical_8        188
algae_1           183
algae_2           183
algae_3           183
algae_4           183
algae_5           200
algae_6           200
algae_7           183

If I fill NaN's with median values like this df = df.fillna(df.median(axis=0), axis=0) each column gets 200 values and I'm able to perform further analysis.

However I want to use another approach and drop NaNs so that each column has the same number of values. When I'm trying df.dropna() , the count of values in each column stays different. And I'm not able to make the regression analysis.

What should be the right approach in order to drop NaNs and keep the number of values in each column equal?

Answer 1

Instead of ['NaN'] , use numpy.nan

import numpy as np
df = df.replace(['XXXXXXX'], np.nan)

Then df.dropna() should work just fine.

How to drop NaNs and get the same number of values in each column in Python?

Question

1 answers

solution1
1 ACCPTED 2018-04-16 17:13:34

How to drop NaNs and get the same number of values in each column in Python?

Question

1 answers

solution1 1 ACCPTED 2018-04-16 17:13:34

solution1
1 ACCPTED 2018-04-16 17:13:34