如何使用pandas替换具有不同随机值的列中的每个NaN？

Question

I have been playing with pandas lately and I now I tried to replace NaN value inside a dataframe with different random value of normal distribution. 我最近一直在玩大熊猫，现在我尝试用不同的正态分布随机值替换数据帧内的NaN值。

Assuming I have this CSV file without header 假设我有没有标题的CSV文件

My expected result should be something like this 我的预期结果应该是这样的

       0
0     343
1     483
2     101
3     randomnumber1
4     randomnumber2
5     randomnumber3

But instead I got the following : 但相反，我得到以下内容：

       0
0     343
1     483
2     101
3     randomnumber1
4     randomnumber1
5     randomnumber1    # all NaN filled with same number

My code so far 我的代码到目前为止

import numpy as np
import pandas as pd

df = pd.read_csv("testfile.csv", header=None)
mu, sigma = df.mean(), df.std()
norm_dist = np.random.normal(mu, sigma, 1)
for i in norm_dist:
    print df.fillna(i)

I am thinking to get the number of NaN row from the dataframe, and replace the number 1 in np.random.normal(mu, sigma, 1) with the total of NaN row so each NaN might have different value. 我想从数据帧中获取NaN行的数量， np.random.normal(mu, sigma, 1)的数字1替换为NaN行的总数，以便每个NaN可能具有不同的值。

But I want to ask if there is other simple method to do this? 但是我想问一下是否有其他简单方法可以做到这一点？

Thank you for your help and suggestion. 感谢您的帮助和建议。

Answer 1

Here's one way working with underlying array data - 这是使用底层数组数据的一种方法 -

def fillNaN_with_unifrand(df):
    a = df.values
    m = np.isnan(a) # mask of NaNs
    mu, sigma = df.mean(), df.std()
    a[m] = np.random.normal(mu, sigma, size=m.sum())
    return df

In essence, we are generating all random numbers in one go with the count of NaNs using the size param with np.random.normal and assigning them in one go with the mask of the NaNs again. 本质上，我们使用带有np.random.normal的大小参数np.random.normal生成所有随机数和NaN的计数，并再次使用NaN的掩码一次性分配它们。

Sample run - 样品运行 -

In [435]: df
Out[435]: 
       0
0  343.0
1  483.0
2  101.0
3    NaN
4    NaN
5    NaN

In [436]: fillNaN_with_unifrand(df)
Out[436]: 
            0
0  343.000000
1  483.000000
2  101.000000
3  138.586483
4  223.454469
5  204.464514

Answer 2

I think you need: 我想你需要：

mu, sigma = df.mean(), df.std()
#get mask of NaNs
a = df[0].isnull()
#get random values by sum ot Trues, processes like 1
norm_dist = np.random.normal(mu, sigma, a.sum())
print (norm_dist)
[ 184.90581318  364.89367364  181.46335348]
#assign values by mask
df.loc[a, 0] = norm_dist
print (df)

            0
0  343.000000
1  483.000000
2  101.000000
3  184.905813
4  364.893674
5  181.463353

Answer 3

It is simple to impute random values in place of missing values in a pandas DataFrame column. 在pandas DataFrame列中输入随机值代替缺失值很简单。

mean = df['column'].mean()
std = df['column'].std()

def fill_missing_from_Gaussian(column_val):
    if np.isnan(column_val) == True: 
        column_val = np.random.normal(mean, std, 1)
    else:
         column_val = column_val
return column_val

Now just apply the above method to a column with missing values. 现在只需将上述方法应用于缺少值的列。

df['column'] = df['column'].apply(fill_missing_from_Gaussian)

如何使用pandas替换具有不同随机值的列中的每个NaN？

问题描述

3 个解决方案

解决方案1
5 已采纳 2017-10-03 11:11:24

解决方案2
1 2017-10-03 11:09:45

解决方案3
1 2018-03-05 17:50:20

如何使用pandas替换具有不同随机值的列中的每个NaN？

问题描述

3 个解决方案

解决方案1 5 已采纳 2017-10-03 11:11:24

解决方案2 1 2017-10-03 11:09:45

解决方案3 1 2018-03-05 17:50:20

解决方案1
5 已采纳 2017-10-03 11:11:24

解决方案2
1 2017-10-03 11:09:45

解决方案3
1 2018-03-05 17:50:20