[英]How to drop NaNs and get the same number of values in each column in Python?
I'm building a linear regression model to examine the relationship between variables from this dataset . 我正在建立一个线性回归模型,以检查该数据集中变量之间的关系。 It contained some 'XXXXXXX' values so first I've replaced them with NaNs: 它包含一些“ XXXXXXX”值,因此首先我将其替换为NaN:
df = df.replace(['XXXXXXX'], ['NaN'])
Then I examined the number of values in each column: 然后,我检查了每列中值的数量:
df.count(axis=0)
It appeared that the number of values varied from column to column: 似乎值的数量在列之间有所不同:
season 200
river size 200
fluid velocity 200
chemical_1 199
chemical_2 198
chemical_3 190
chemical_4 198
chemical_5 198
chemical_6 198
chemical_7 198
chemical_8 188
algae_1 183
algae_2 183
algae_3 183
algae_4 183
algae_5 200
algae_6 200
algae_7 183
If I fill NaN's with median values like this df = df.fillna(df.median(axis=0), axis=0)
each column gets 200 values and I'm able to perform further analysis. 如果我用诸如df = df.fillna(df.median(axis=0), axis=0)
的中值填充NaN df = df.fillna(df.median(axis=0), axis=0)
每列将获得200个值,并且我可以进行进一步的分析。
However I want to use another approach and drop NaNs so that each column has the same number of values. 但是,我想使用另一种方法并删除NaN,以便每列具有相同数量的值。 When I'm trying df.dropna()
, the count of values in each column stays different. 当我尝试df.dropna()
,每列中的值计数保持不同。 And I'm not able to make the regression analysis. 而且我无法进行回归分析。
What should be the right approach in order to drop NaNs and keep the number of values in each column equal? 为了删除NaN并使每列中的值数量相等,正确的方法是什么?
Instead of ['NaN']
, use numpy.nan
代替['NaN']
,使用numpy.nan
import numpy as np
df = df.replace(['XXXXXXX'], np.nan)
Then df.dropna()
should work just fine. 然后df.dropna()
应该可以正常工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.