简体   繁体   English

pandas DataFrame:用列的平均值替换 nan 值

[英]pandas DataFrame: replace nan values with average of columns

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.我有一个主要填充实数的 Pandas DataFrame,但其中也有一些nan值。

How can I replace the nan s with averages of columns where they are?如何用它们所在的列的平均值替换nan

This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.这个问题与这个问题非常相似: numpy array:replace nan values with average of columns但不幸的是,那里给出的解决方案不适用于pandas DataFrame。

You can simply use DataFrame.fillna to fill the nan 's directly:您可以简单地使用DataFrame.fillna直接填充nan

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. fillna的文档字符串说value应该是标量或字典,但是,它似乎也适用于Series If you want to pass a dict, you could use df.mean().to_dict() .如果你想传递一个字典,你可以使用df.mean().to_dict()

尝试:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill每列应用该列的平均值并填充

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column.如果您想用均值来估算缺失值并且想要逐列进行,那么这只会用该列的均值来估算。 This might be a little more readable.这可能更具可读性。

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')

X = Dataset.iloc[:, :-1].values

# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

Directly use df.fillna(df.mean()) to fill all the null value with mean直接使用df.fillna(df.mean())用均值填充所有空值

If you want to fill null value with mean of that column then you can use this如果你想用该列的平均值填充空值,那么你可以使用这个

suppose x=df['Item_Weight'] here Item_Weight is column name假设x=df['Item_Weight']这里Item_Weight是列名

here we are assigning (fill null values of x with mean of x into x)在这里我们分配(用 x 的平均值填充 x 的空值到 x 中)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use如果你想用一些字符串填充空值,然后使用

here Outlet_size is column name这里Outlet_size是列名

df.Outlet_Size = df.Outlet_Size.fillna('Missing')

Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:虽然,下面的代码完成了这项工作,但它的性能受到了很大的打击,因为你处理一个带有 # 记录 100k 或更多的 DataFrame:

df.fillna(df.mean())

In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame .根据我的经验,应该只在需要的地方替换 NaN 值(无论是用 Mean 还是 Median) ,而不是在整个 DataFrame 上应用 fillna()

I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement).我有一个包含 20 个变量的 DataFrame,其中只有 4 个需要 NaN 值处理(替换)。 I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .ie only on variables which had a NaN value我尝试了上面的代码(代码 1),以及它的一个稍微修改过的版本(代码 2),我有选择地运行它。即仅在具有 NaN 值的变量上运行

#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----

df.fillna(df.mean())

#------------------------------------------------
#----(Code 2) Selective Treatment----------------

for i in df.columns[df.isnull().any(axis=0)]:     #---Applying Only on variables with NaN values
    df[i].fillna(df[i].mean(),inplace=True)

#---df.isnull().any(axis=0) gives True/False flag (Boolean value series), 
#---which when applied on df.columns[], helps identify variables with NaN values

Below is the performance i observed, as i kept on increasing the # records in DataFrame以下是我观察到的性能,因为我不断增加 DataFrame 中的 # 记录

DataFrame with ~100k records具有约 10 万条记录的数据帧

  • Code 1: 22.06 Seconds代码 1:22.06 秒
  • Code 2: 0.03 Seconds代码 2:0.03 秒

DataFrame with ~200k records具有约 20 万条记录的数据帧

  • Code 1: 180.06 Seconds代码 1:180.06 秒
  • Code 2: 0.06 Seconds代码 2:0.06 秒

DataFrame with ~1.6 Million records具有约 160 万条记录的 DataFrame

  • Code 1: code kept running endlessly代码 1:代码无休止地运行
  • Code 2: 0.40 Seconds代码 2:0.40 秒

DataFrame with ~13 Million records具有约 1300 万条记录的 DataFrame

  • Code 1: --did not even try, after seeing performance on 1.6 Mn records--代码 1:--在看到 160 万条记录的性能后,甚至没有尝试--
  • Code 2: 3.20 Seconds代码 2:3.20 秒

Apologies for a long answer !抱歉回答太长! Hope this helps !希望这可以帮助 !

Another option besides those above is:除了上述选项之外的另一个选项是:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.它不如之前的均值响应优雅,但如果您希望用其他列函数替换空值,它可能会更短。

Pandas: How to replace NaN ( nan ) values with the average (mean), median or other statistics of one column Pandas:如何用一列的平均值(均值)、中位数或其他统计数据替换 NaN ( nan ) 值

Say your DataFrame is df and you have one column called nr_items .假设您的 DataFrame 是df并且您有一列名为nr_items This is: df['nr_items']这是: df['nr_items']

If you want to replace the NaN values of your column df['nr_items'] with the mean of the column :如果要将df['nr_items']列的NaN替换为该列的平均值

Use method .fillna() :使用方法.fillna()

mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)

I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.我创建了一个新的df称为列nr_item_ave存储与新列NaN被替换值mean列的值。

You should be careful when using the mean .使用mean时应该小心。 If you have outliers is more recommendable to use the median如果您有异常值更推荐使用中median

using sklearn library preprocessing class使用sklearn库预处理类

from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])

Note: In the recent version parameter missing_values value change to np.nan from NaN注意:在最近的版本中,参数missing_values值从NaN更改为np.nan

I use this method to fill missing values by average of a column.我使用此方法按列的平均值填充缺失值。

fill_mean = lambda col : col.fillna(col.mean())

df = df.apply(fill_mean, axis = 0)

You can also use value_counts to get the most frequent values.您还可以使用value_counts来获取最频繁的值。 This would work on different datatypes.这适用于不同的数据类型。

df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))

Here is the value_counts api reference. 是 value_counts api 参考。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM