繁体   English   中英

用沿数组轴的非零均值替换零 - Python / NumPy

[英]Replace zeros with mean of non-zeros along an axis of array - Python / NumPy

如何通过剩余行替换第一行的 0?

import numpy as np
from sklearn.impute import SimpleImputer


data = np.array([[0,0,0,0,3,2,4,4,0], 
                  [4,6,8,9,3,1,1,4,0],
                  [4,6,8,9,3,1,1,4,0]]) 
print (data.shape)

imputer = SimpleImputer(missing_values=0, strategy='mean')
res = imputer.fit_transform(data) 
print (res)

[[4. 6. 8. 9. 3. 2. 4. 4.]
 [4. 6. 8. 9. 3. 1. 1. 4.]
 [4. 6. 8. 9. 3. 1. 1. 4.]]

但是,不应删除任何列。

预期结果是:

[[4. 6. 8. 9. 3. 2. 4. 4. 0]
 [4. 6. 8. 9. 3. 1. 1. 4. 0]
 [4. 6. 8. 9. 3. 1. 1. 4. 0]]

有什么想法吗,伙计们?

只需索引就足以满足您的需求:

m = data[0] == 0
data[0, m] = data[1:,m].mean(0)

print(data)

array([[4, 6, 8, 9, 3, 2, 4, 4, 0],
       [4, 6, 8, 9, 3, 1, 1, 4, 0],
       [4, 6, 8, 9, 3, 1, 1, 4, 0]])

要从所有其他行的均值中填充所有零并从均值中排除零,我们可以使用掩码数组:

m = data == 0
means = np.ma.array(data, mask = m).mean(0)
data + m * means.data

array([[4., 6., 8., 9., 3., 2., 4., 4., 0.],
       [4., 6., 8., 9., 3., 1., 1., 4., 0.],
       [4., 6., 8., 9., 3., 1., 1., 4., 0.]])

更新

要填充其他列的平均值,您可以类似地执行以下操作:

m = data == 0
means = np.ma.array(data, mask = m).mean(1)
data + m * means.data[:,None]

array([[3.25, 3.25, 3.25, 3.25, 3.  , 2.  , 4.  , 4.  , 3.25],
       [4.  , 6.  , 8.  , 9.  , 3.  , 1.  , 1.  , 4.  , 4.5 ],
       [4.  , 6.  , 8.  , 9.  , 3.  , 1.  , 1.  , 4.  , 4.5 ]])

这是沿通用轴应用axis参数的一种方法 -

def fill0s(data, axis):
    m = data!=0
    s = data.sum(axis, keepdims=True)
    c = m.sum(axis, keepdims=True)
    c[c==0] = 1 # to avoid warning of division by 0
    return np.where(m,data,s/c)

样品运行 -

In [143]: data
Out[143]: 
array([[0, 0, 0, 0, 3, 2, 4, 4, 0],
       [4, 6, 8, 9, 3, 1, 1, 4, 0],
       [6, 6, 8, 9, 3, 1, 1, 4, 0],
       [0, 6, 8, 9, 3, 1, 1, 4, 0]])

In [144]: fill0s(data,axis=0)
Out[144]: 
array([[5., 6., 8., 9., 3., 2., 4., 4., 0.],
       [4., 6., 8., 9., 3., 1., 1., 4., 0.],
       [6., 6., 8., 9., 3., 1., 1., 4., 0.],
       [5., 6., 8., 9., 3., 1., 1., 4., 0.]])

In [147]: fill0s(data,axis=1)
Out[147]: 
array([[3.25, 3.25, 3.25, 3.25, 3.  , 2.  , 4.  , 4.  , 3.25],
       [4.  , 6.  , 8.  , 9.  , 3.  , 1.  , 1.  , 4.  , 4.5 ],
       [6.  , 6.  , 8.  , 9.  , 3.  , 1.  , 1.  , 4.  , 4.75],
       [4.57, 6.  , 8.  , 9.  , 3.  , 1.  , 1.  , 4.  , 4.57]])

更大数据集的时间 -

In [150]: np.random.seed(0)

In [151]: data = np.random.randint(0,10,(5000,5000))

In [152]: %timeit fill0s(data,axis=0)
161 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [153]: %timeit fill0s(data,axis=1)
155 ms ± 6.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#@yatu's solution
In [155]: %%timeit
     ...: m = data == 0
     ...: means = np.ma.array(data, mask = m).mean(0)
     ...: data + m * means.data
302 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [156]: %%timeit
     ...: m = data == 0
     ...: means = np.ma.array(data, mask = m).mean(1)
     ...: data + m * means.data[:,None]
291 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM