简体   繁体   English

Python Pandas Dataframe填充NaN值

[英]Python Pandas Dataframe fill NaN values

I am trying to fill NaN values in a dataframe with values coming from a standard normal distribution. 我试图在数据框中填充NaN值,其值来自标准正态分布。 This is currently my code: 这是我目前的代码:

 sqlStatement = "select * from sn.clustering_normalized_dataset"
 df = psql.frame_query(sqlStatement, cnx)
 data=df.pivot("user","phrase","tfw")
 dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
 data[np.isnan(data)] = dfrand[np.isnan(data)]

After pivoting the dataframe 'data' it looks like that: 在旋转数据框“数据”后,它看起来像这样:

phrase      aaron  abbas  abdul       abe  able  abroad       abu     abuse  \
user                                                                          
14233664      NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
52602716      NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
123456789     NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
500158258     NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
517187571     0.4    NaN    NaN  0.142857     1     0.4  0.181818       NaN  

However, I need that each NaN value will be replaced with a new random value. 但是,我需要将每个NaN值替换为新的随机值。 So I created a new df consists of only random values (dfrand) and then trying to swap the missing numbers (Nan) by the values from dfrand corresponding to indices of the NaN's. 所以我创建了一个新的df,它只包含随机值(dfrand),然后尝试用dfrand中与NaN索引相对应的值交换缺失的数字(Nan)。 Well - unfortunately it doesn't work - Although the expression 嗯 - 不幸的是它不起作用 - 虽然表达

 np.isnan(data)

returns a dataframe consists of True and False values, the expression 返回一个数据帧,由True和False值组成,表达式

  dfrand[np.isnan(data)]

return only NaN values so the overall trick doesn't work. 仅返回NaN值,因此整体技巧不起作用。 Any ideas what the issue ? 任何想法有什么问题?

Three-thousand columns is not so many. 三千列不是那么多。 How many rows do you have? 你有几行? You could always make a random dataframe of the same size and do a logical replacement (the size of your dataframe will dictate whether this is feasible or not. 您总是可以制作相同大小的随机数据帧并进行逻辑替换(数据帧的大小将决定这是否可行)。

if you know the size of your dataframe: 如果您知道数据帧的大小:

import pandas as pd
import numpy as np

# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(rows,cols))

# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in

# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]

if you do not know the size of your dataframe, just shuffle things around 如果你不知道你的数据框的大小,只需要改变一下

import pandas as pd
import numpy as np



# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in

# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))

# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]

EDIT Per "users" last comment: "dfrand[np.isnan(data)] returns NaN only." 编辑每个“用户”的最后评论:“dfrand [np.isnan(data)]仅返回NaN。”

Right! 对! And that is exactly what you wanted. 这正是你想要的。 In my solution I have: data[np.isnan(data)] = dfrand[np.isnan(data)]. 在我的解决方案中,我有:data [np.isnan(data)] = dfrand [np.isnan(data)]。 Translated, this means: take the randomly-generated value from dfrand that corresponds to the NaN-location within "data" and insert it in "data" where "data" is NaN. 翻译,这意味着:从dfrand中随机生成的值对应于“data”中的NaN位置,并将其插入“data”,其中“data”是NaN。 An example will help: 一个例子将有助于:

a = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
a[0][5] = np.nan

In [32]: a
Out[33]: 
    0   1   2
0   2  26  28
1  14  79  82
2  89  32  59
3  65  47  31
4  29  59  15
5 NaN  58  90
6  15  66  60
7  10  19  96
8  90  26  92
9   0  19  23

# define randomly-generated dataframe, much like what you are doing, and replace NaN's
b = pd.DataFrame(data=np.random.randint(0,100,(10,3)))

In [39]: b
Out[39]: 
    0   1   2
0  92  21  55
1  65  53  89
2  54  98  97
3  48  87  79
4  98  38  62
5  46  16  30
6  95  39  70
7  90  59   9
8  14  85  37
9  48  29  46


a[np.isnan(a)] = b[np.isnan(a)]

In [38]: a
Out[38]: 
    0   1   2
0   2  26  28
1  14  79  82
2  89  32  59
3  65  47  31
4  29  59  15
5  46  58  90
6  15  66  60
7  10  19  96
8  90  26  92
9   0  19  23

As you can see, all NaN's in have been replaced with the randomly-generated value in based on 's nan-value indices. 正如您所看到的,所有NaN都已被基于纳米价值指数的随机生成值所取代。

you could try something like this, assuming you are dealing with one series: 你可以尝试这样的事情,假设你正在处理一个系列:

ser = data['column_with_nulls_to_replace']
index = ser[ser.isnull()].index
df = pd.DataFrame(np.random.randn(len(index)), index=index, columns=['column_with_nulls_to_replace'])
ser.update(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM