[英]pandas DataFrame: replace nan values with average of columns
I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan
values in it as well.我有一个主要填充实数的 Pandas DataFrame,但其中也有一些nan
值。
How can I replace the nan
s with averages of columns where they are?如何用它们所在的列的平均值替换nan
?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.这个问题与这个问题非常相似: numpy array:replace nan values with average of columns但不幸的是,那里给出的解决方案不适用于pandas DataFrame。
You can simply use DataFrame.fillna
to fill the nan
's directly:您可以简单地使用DataFrame.fillna
直接填充nan
:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna
says that value
should be a scalar or a dict, however, it seems to work with a Series
as well. fillna
的文档字符串说value
应该是标量或字典,但是,它似乎也适用于Series
。 If you want to pass a dict, you could use df.mean().to_dict()
.如果你想传递一个字典,你可以使用df.mean().to_dict()
。
尝试:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill每列应用该列的平均值并填充
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column.如果您想用均值来估算缺失值并且想要逐列进行,那么这只会用该列的均值来估算。 This might be a little more readable.这可能更具可读性。
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean())
to fill all the null value with mean直接使用df.fillna(df.mean())
用均值填充所有空值
If you want to fill null value with mean of that column then you can use this如果你想用该列的平均值填充空值,那么你可以使用这个
suppose x=df['Item_Weight']
here Item_Weight
is column name假设x=df['Item_Weight']
这里Item_Weight
是列名
here we are assigning (fill null values of x with mean of x into x)在这里我们分配(用 x 的平均值填充 x 的空值到 x 中)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use如果你想用一些字符串填充空值,然后使用
here Outlet_size
is column name这里Outlet_size
是列名
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:虽然,下面的代码完成了这项工作,但它的性能受到了很大的打击,因为你处理一个带有 # 记录 100k 或更多的 DataFrame:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame .根据我的经验,应该只在需要的地方替换 NaN 值(无论是用 Mean 还是 Median) ,而不是在整个 DataFrame 上应用 fillna() 。
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement).我有一个包含 20 个变量的 DataFrame,其中只有 4 个需要 NaN 值处理(替换)。 I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .ie only on variables which had a NaN value我尝试了上面的代码(代码 1),以及它的一个稍微修改过的版本(代码 2),我有选择地运行它。即仅在具有 NaN 值的变量上运行
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame以下是我观察到的性能,因为我不断增加 DataFrame 中的 # 记录
DataFrame with ~100k records具有约 10 万条记录的数据帧
DataFrame with ~200k records具有约 20 万条记录的数据帧
DataFrame with ~1.6 Million records具有约 160 万条记录的 DataFrame
DataFrame with ~13 Million records具有约 1300 万条记录的 DataFrame
Apologies for a long answer !抱歉回答太长! Hope this helps !希望这可以帮助 !
Another option besides those above is:除了上述选项之外的另一个选项是:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.它不如之前的均值响应优雅,但如果您希望用其他列函数替换空值,它可能会更短。
Pandas: How to replace NaN ( nan
) values with the average (mean), median or other statistics of one column Pandas:如何用一列的平均值(均值)、中位数或其他统计数据替换 NaN ( nan
) 值
Say your DataFrame is df
and you have one column called nr_items
.假设您的 DataFrame 是df
并且您有一列名为nr_items
。 This is: df['nr_items']
这是: df['nr_items']
If you want to replace the NaN
values of your column df['nr_items']
with the mean of the column :如果要将df['nr_items']
列的NaN
值替换为该列的平均值:
Use method .fillna()
:使用方法.fillna()
:
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df
column called nr_item_ave
to store the new column with the NaN
values replaced by the mean
value of the column.我创建了一个新的df
称为列nr_item_ave
存储与新列NaN
被替换值mean
列的值。
You should be careful when using the mean
.使用mean
时应该小心。 If you have outliers is more recommendable to use the median
如果您有异常值更推荐使用中median
using sklearn library preprocessing class使用sklearn库预处理类
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values
value change to np.nan
from NaN
注意:在最近的版本中,参数missing_values
值从NaN
更改为np.nan
I use this method to fill missing values by average of a column.我使用此方法按列的平均值填充缺失值。
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.