简体   繁体   English

如何使用pandas替换具有不同随机值的列中的每个NaN?

[英]How to replace every NaN in a column with different random values using pandas?

I have been playing with pandas lately and I now I tried to replace NaN value inside a dataframe with different random value of normal distribution. 我最近一直在玩大熊猫,现在我尝试用不同的正态分布随机值替换数据帧内的NaN值。

Assuming I have this CSV file without header 假设我有没有标题的CSV文件

      0
0    343
1    483
2    101
3    NaN
4    NaN
5    NaN

My expected result should be something like this 我的预期结果应该是这样的

       0
0     343
1     483
2     101
3     randomnumber1
4     randomnumber2
5     randomnumber3

But instead I got the following : 但相反,我得到以下内容:

       0
0     343
1     483
2     101
3     randomnumber1
4     randomnumber1
5     randomnumber1    # all NaN filled with same number

My code so far 我的代码到目前为止

import numpy as np
import pandas as pd

df = pd.read_csv("testfile.csv", header=None)
mu, sigma = df.mean(), df.std()
norm_dist = np.random.normal(mu, sigma, 1)
for i in norm_dist:
    print df.fillna(i)

I am thinking to get the number of NaN row from the dataframe, and replace the number 1 in np.random.normal(mu, sigma, 1) with the total of NaN row so each NaN might have different value. 我想从数据帧中获取NaN行的数量, np.random.normal(mu, sigma, 1)的数字1替换为NaN行的总数,以便每个NaN可能具有不同的值。

But I want to ask if there is other simple method to do this? 但是我想问一下是否有其他简单方法可以做到这一点?

Thank you for your help and suggestion. 感谢您的帮助和建议。

Here's one way working with underlying array data - 这是使用底层数组数据的一种方法 -

def fillNaN_with_unifrand(df):
    a = df.values
    m = np.isnan(a) # mask of NaNs
    mu, sigma = df.mean(), df.std()
    a[m] = np.random.normal(mu, sigma, size=m.sum())
    return df

In essence, we are generating all random numbers in one go with the count of NaNs using the size param with np.random.normal and assigning them in one go with the mask of the NaNs again. 本质上,我们使用带有np.random.normal大小参数np.random.normal生成所有随机数和NaN的计数,并再次使用NaN的掩码一次性分配它们。

Sample run - 样品运行 -

In [435]: df
Out[435]: 
       0
0  343.0
1  483.0
2  101.0
3    NaN
4    NaN
5    NaN

In [436]: fillNaN_with_unifrand(df)
Out[436]: 
            0
0  343.000000
1  483.000000
2  101.000000
3  138.586483
4  223.454469
5  204.464514

I think you need: 我想你需要:

mu, sigma = df.mean(), df.std()
#get mask of NaNs
a = df[0].isnull()
#get random values by sum ot Trues, processes like 1
norm_dist = np.random.normal(mu, sigma, a.sum())
print (norm_dist)
[ 184.90581318  364.89367364  181.46335348]
#assign values by mask
df.loc[a, 0] = norm_dist
print (df)

            0
0  343.000000
1  483.000000
2  101.000000
3  184.905813
4  364.893674
5  181.463353

It is simple to impute random values in place of missing values in a pandas DataFrame column. 在pandas DataFrame列中输入随机值代替缺失值很简单。

mean = df['column'].mean()
std = df['column'].std()

def fill_missing_from_Gaussian(column_val):
    if np.isnan(column_val) == True: 
        column_val = np.random.normal(mean, std, 1)
    else:
         column_val = column_val
return column_val

Now just apply the above method to a column with missing values. 现在只需将上述方法应用于缺少值的列。

df['column'] = df['column'].apply(fill_missing_from_Gaussian) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM