numpy 数组：用列的平均值替换 nan 值

Question

I've got a numpy array filled mostly with real numbers, but there is a few nan values in it as well.我有一个主要用实数填充的 numpy 数组，但其中也有一些nan值。

How can I replace the nan s with averages of columns where they are?如何用它们所在的列的平均值替换nan ？

Answer 1

No loops required:不需要循环：

print(a)
[[ 0.93230948         nan  0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785         nan]
 [ 0.64940216  0.74414127         nan         nan]]

#Obtain mean of columns as you need, nanmean is convenient.
col_mean = np.nanmean(a, axis=0)
print(col_mean)
[ 0.86726219  0.7030395   0.44528687  0.66640474]

#Find indices that you need to replace
inds = np.where(np.isnan(a))

#Place column means in the indices. Align the arrays using take
a[inds] = np.take(col_mean, inds[1])

print(a)
[[ 0.93230948  0.7030395   0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785  0.66640474]
 [ 0.64940216  0.74414127  0.44528687  0.66640474]]

Answer 2

Using masked arrays使用屏蔽数组

The standard way to do this using only numpy would be to use the masked array module.仅使用 numpy 执行此操作的标准方法是使用掩码数组模块。

Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. Scipy 是一个非常重的包，它依赖于外部库，因此值得拥有一个 numpy-only 方法。 This borrows from @DonaldHobson's answer.这借鉴了@DonaldHobson 的回答。

Edit: np.nanmean is now a numpy function.编辑： np.nanmean现在是一个 numpy 函数。 However, it doesn't handle all-nan columns...但是，它不处理全纳米列......

Suppose you have an array a :假设你有一个数组a ：

>>> a
array([[  0.,  nan,  10.,  nan],
       [  1.,   6.,  nan,  nan],
       [  2.,   7.,  12.,  nan],
       [  3.,   8.,  nan,  nan],
       [ nan,   9.,  14.,  nan]])

>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)    
array([[  0. ,   7.5,  10. ,   0. ],
       [  1. ,   6. ,  12. ,   0. ],
       [  2. ,   7. ,  12. ,   0. ],
       [  3. ,   8. ,  12. ,   0. ],
       [  1.5,   9. ,  14. ,   0. ]])

Note that the masked array's mean does not need to be the same shape as a , because we're taking advantage of the implicit broadcasting over rows.请注意，屏蔽数组的平均值并不需要是相同的形状， a ，因为我们隐含的是趁广播了行。

Also note how the all-nan column is nicely handled.还要注意 all-nan 列是如何被很好地处理的。 The mean is zero since you're taking the mean of zero elements.平均值为零，因为您取的是零元素的平均值。 The method using nanmean doesn't handle all-nan columns:使用nanmean的方法不处理所有 nan 列：

>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
  warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[  0. ,   7.5,  10. ,   nan],
       [  1. ,   6. ,  12. ,   nan],
       [  2. ,   7. ,  12. ,   nan],
       [  3. ,   8. ,  12. ,   nan],
       [  1.5,   9. ,  14. ,   nan]])

Explanation解释

Converting a into a masked array gives you将a转换为掩码数组为您提供

>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
 [[0.0 --  10.0 --]
  [1.0 6.0 --   --]
  [2.0 7.0 12.0 --]
  [3.0 8.0 --   --]
  [--  9.0 14.0 --]],
             mask =
 [[False  True False  True]
 [False False  True  True]
 [False False False  True]
 [False False  True  True]
 [ True False False  True]],
       fill_value = 1e+20)

And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:并在列上取平均值为您提供正确答案，仅对非屏蔽值进行标准化：

>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
             mask = [False False False  True],
       fill_value = 1e+20)

Further, note how the mask nicely handles the column which is all-nan !此外，请注意掩码如何很好地处理全 nan列！

Finally, np.where does the job of replacement.最后， np.where完成替换工作。

Row-wise mean行均值

To replace nan values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:要用行均值代替列均值替换nan值需要进行微小的更改才能使广播很好地生效：

>>> a
array([[  0.,   1.,   2.,   3.,  nan],
       [ nan,   6.,   7.,   8.,   9.],
       [ 10.,  nan,  12.,  nan,  14.],
       [ nan,  nan,  nan,  nan,  nan]])

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[  0. ,   1. ,   2. ,   3. ,   1.5],
       [  7.5,   6. ,   7. ,   8. ,   9. ],
       [ 10. ,  12. ,  12. ,  12. ,  14. ],
       [  0. ,   0. ,   0. ,   0. ,   0. ]])

Answer 3

如果partial是您的原始数据，而replace是一个包含平均值的相同形状的数组，那么此代码将使用 partial 中的值（如果存在）。

Complete= np.where(np.isnan(partial),replace,partial)

Answer 4

Alternative : Replacing NaNs with interpolation of columns.替代方法：用列插值替换 NaN。

def interpolate_nans(X):
    """Overwrite NaNs with column value interpolations."""
    for j in range(X.shape[1]):
        mask_j = np.isnan(X[:,j])
        X[mask_j,j] = np.interp(np.flatnonzero(mask_j), np.flatnonzero(~mask_j), X[~mask_j,j])
    return X

Example use:使用示例：

X_incomplete = np.array([[10,     20,     30    ],
                         [np.nan, 30,     np.nan],
                         [np.nan, np.nan, 50    ],
                         [40,     50,     np.nan    ]])

X_complete = interpolate_nans(X_incomplete)

print X_complete
[[10,     20,     30    ],
 [20,     30,     40    ],
 [30,     40,     50    ],
 [40,     50,     50    ]]

I use this bit of code for time series data in particular, where columns are attributes and rows are time-ordered samples.我特别将这段代码用于时间序列数据，其中列是属性，行是按时间排序的样本。

Answer 5

This isn't very clean but I can't think of a way to do it other than iterating这不是很干净，但我想不出除了迭代之外的其他方法

#example
a = np.arange(16, dtype = float).reshape(4,4)
a[2,2] = np.nan
a[3,3] = np.nan

indices = np.where(np.isnan(a)) #returns an array of rows and column indices
for row, col in zip(*indices):
    a[row,col] = np.mean(a[~np.isnan(a[:,col]), col])

Answer 6

To extend Donald's Answer I provide a minimal example.为了扩展唐纳德的回答，我提供了一个最小的例子。 Let's say a is an ndarray and we want to replace its zero values with the mean of the column.假设a是一个 ndarray，我们想用列的平均值替换它的零值。

In [231]: a
Out[231]: 
array([[0, 3, 6],
       [2, 0, 0]])


In [232]: col_mean = np.nanmean(a, axis=0)
Out[232]: array([ 1. ,  1.5,  3. ])

In [228]: np.where(np.equal(a, 0), col_mean, a)
Out[228]: 
array([[ 1. ,  3. ,  6. ],
       [ 2. ,  1.5,  3. ]])

Answer 7

Using simple functions with loops:使用带循环的简单函数：

a=[[0.93230948, np.nan, 0.47773439, 0.76998063],
  [0.94460779, 0.87882456, 0.79615838, 0.56282885],
  [0.94272934, 0.48615268, 0.06196785, np.nan],
  [0.64940216, 0.74414127, np.nan, np.nan],
  [0.64940216, 0.74414127, np.nan, np.nan]]

print("------- original array -----")
for aa in a:
    print(aa)

# GET COLUMN MEANS: 
ta = np.array(a).T.tolist()                         # transpose the array; 
col_means = list(map(lambda x: np.nanmean(x), ta))  # get means; 
print("column means:", col_means)

# REPLACE NAN ENTRIES WITH COLUMN MEANS: 
nrows = len(a); ncols = len(a[0]) # get number of rows & columns; 
for r in range(nrows):
    for c in range(ncols):
        if np.isnan(a[r][c]):
            a[r][c] = col_means[c]

print("------- means added -----")
for aa in a:
    print(aa)

Output:输出：

------- original array -----
[0.93230948, nan, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, nan]
[0.64940216, 0.74414127, nan, nan]
[0.64940216, 0.74414127, nan, nan]

column means: [0.82369018599999999, 0.71331494500000003, 0.44528687333333333, 0.66640474000000005]

------- means added -----
[0.93230948, 0.71331494500000003, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]

The for loops can also be written with list comprehension: for 循环也可以用列表推导式编写：

new_a = [[col_means[c] if np.isnan(a[r][c]) else a[r][c] 
            for c in range(ncols) ]
        for r in range(nrows) ]

Answer 8

you might want to try this built-in function:你可能想试试这个内置函数：

x = np.array([np.inf, -np.inf, np.nan, -128, 128])
np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
-1.28000000e+002,   1.28000000e+002])

numpy 数组：用列的平均值替换 nan 值

问题描述

8 个解决方案

解决方案1
86 已采纳 2013-09-08 22:51:06

解决方案2
14 2016-10-24 00:23:51

Using masked arrays使用屏蔽数组

解决方案3
5 2016-08-29 15:18:59

解决方案4
4 2016-03-18 08:52:13

解决方案5
2 2013-09-08 22:42:37

解决方案6
2 2016-10-23 20:25:01

解决方案7
0 2018-01-09 12:12:13

解决方案8
-3 2015-03-09 15:35:48

numpy 数组：用列的平均值替换 nan 值

问题描述

8 个解决方案

解决方案1 86 已采纳 2013-09-08 22:51:06

解决方案2 14 2016-10-24 00:23:51

Using masked arrays使用屏蔽数组

解决方案3 5 2016-08-29 15:18:59

解决方案4 4 2016-03-18 08:52:13

解决方案5 2 2013-09-08 22:42:37

解决方案6 2 2016-10-23 20:25:01

解决方案7 0 2018-01-09 12:12:13

解决方案8 -3 2015-03-09 15:35:48

解决方案1
86 已采纳 2013-09-08 22:51:06

解决方案2
14 2016-10-24 00:23:51

解决方案3
5 2016-08-29 15:18:59

解决方案4
4 2016-03-18 08:52:13

解决方案5
2 2013-09-08 22:42:37

解决方案6
2 2016-10-23 20:25:01

解决方案7
0 2018-01-09 12:12:13

解决方案8
-3 2015-03-09 15:35:48