简体   繁体   English

如何在Python的每一列中删除NaN并获得相同数量的值?

[英]How to drop NaNs and get the same number of values in each column in Python?

I'm building a linear regression model to examine the relationship between variables from this dataset . 我正在建立一个线性回归模型,以检查该数据集中变量之间的关系。 It contained some 'XXXXXXX' values so first I've replaced them with NaNs: 它包含一些“ XXXXXXX”值,因此首先我将其替换为NaN:

df = df.replace(['XXXXXXX'], ['NaN'])

Then I examined the number of values in each column: 然后,我检查了每列中值的数量:

df.count(axis=0)

It appeared that the number of values varied from column to column: 似乎值的数量在列之间有所不同:

season            200
river size        200
fluid velocity    200
chemical_1        199
chemical_2        198
chemical_3        190
chemical_4        198
chemical_5        198
chemical_6        198
chemical_7        198
chemical_8        188
algae_1           183
algae_2           183
algae_3           183
algae_4           183
algae_5           200
algae_6           200
algae_7           183

If I fill NaN's with median values like this df = df.fillna(df.median(axis=0), axis=0) each column gets 200 values and I'm able to perform further analysis. 如果我用诸如df = df.fillna(df.median(axis=0), axis=0)的中值填充NaN df = df.fillna(df.median(axis=0), axis=0)每列将获得200个值,并且我可以进行进一步的分析。

However I want to use another approach and drop NaNs so that each column has the same number of values. 但是,我想使用另一种方法并删除NaN,以便每列具有相同数量的值。 When I'm trying df.dropna() , the count of values in each column stays different. 当我尝试df.dropna() ,每列中的值计数保持不同。 And I'm not able to make the regression analysis. 而且我无法进行回归分析。

What should be the right approach in order to drop NaNs and keep the number of values in each column equal? 为了删除NaN并使每列中的值数量相等,正确的方法是什么?

Instead of ['NaN'] , use numpy.nan 代替['NaN'] ,使用numpy.nan

import numpy as np
df = df.replace(['XXXXXXX'], np.nan)

Then df.dropna() should work just fine. 然后df.dropna()应该可以正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM