[英]Scikit-Learn Custom Imputer with random value around mean value
I want to create a Custom Imputer to replace NaN
values in my data with a random value in the range of mean - std
and mean + std
for the column where the NaN
value is in.我想创建一个自定义Imputer取代
NaN
与范围内的随机值在我的数据值mean - std
和mean + std
对于其中列NaN
值在不在。
This is the code for the Imputer i have so far:这是迄今为止我拥有的 Imputer 的代码:
class GroupImputer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
X = check_array(X, force_all_finite=False)
self.means = np.nanmean(X, axis=0)
self.stds = np.nanstd(X, axis=0)
return self
def transform(self, X, y=None):
check_is_fitted(self, 'means')
check_is_fitted(self, 'stds')
X = check_array(X, force_all_finite=False)
# how do i apply to each row of the data?
return 0
The self.means
contains a list of the means
for each column. self.means
包含每个列的means
列表。
The self.stds
contains a list of all stds
for each column. self.stds
包含每列的所有stds
的列表。
How do i apply a random value between mean - std
and mean + std
for each NaN
in a row of the data?我如何为一行数据中的每个
NaN
应用mean - std
和mean + std
之间的随机值?
Do i have to iterate through the data?我必须遍历数据吗? (
for row in X:
) and pick the right mean and std according to the column index? (
for row in X:
)并根据列索引选择正确的均值和标准差? Or is there a method which will do this?或者有没有一种方法可以做到这一点?
No, you don't have to iterate through the data, assume the number of rows and the number of columns of your data are 5 and 4, respectively不,您不必遍历数据,假设数据的行数和列数分别为 5 和 4
num_rows,num_cols = 5,4
# just fake two arrays of column means and stds
column_means = np.random.uniform(1,8,num_cols)
column_stds = np.random.rand(num_cols)
disp = np.random.uniform(column_means-column_stds,column_means+column_stds, size=(num_rows,num_cols))
The array disp
is something like数组
disp
类似于
array([[6.29377845, 6.56185572, 5.32590954, 2.14719305],
[6.36648777, 6.97781432, 4.89773801, 2.21909144],
[5.38109603, 6.70649396, 5.50100582, 2.26518757],
[5.59764259, 6.90297057, 5.65199988, 2.25340505],
[5.80928963, 6.4976407 , 5.23792109, 1.99580784]])
in which each column of this array is uniformly sampled from the range (the column mean - the column std, the column mean + the column std)
.其中该数组的每一列都是从范围
(the column mean - the column std, the column mean + the column std)
均匀采样的。 Therefore, the NaN
entries of the original array can be replaced with the entries of disp
.因此,原始数组的
NaN
条目可以替换为disp
的条目。
No, there is a better option then iterate through the data.不,有一个更好的选择然后遍历数据。 You can create a uniformly random array (between the desired bounds) with the same shape) and replace every NaN value at index i with the random value at the same index.
您可以创建一个具有相同形状的均匀随机数组(在所需边界之间),并将索引 i 处的每个 NaN 值替换为相同索引处的随机值。
higher_bound = self.means + self.stds
lower_bound = self.means - self.stds
random_values = numpy.random.uniform(low=lower_bound, high=higher_bound , size=X.shape) #uniformly random array with the same shape
nan_mask = np.isnan(X) #indicates where is nan
X = np.where(nan_mask, random_values, X) #takes from random_values where nan_mask is true, else takes from original array
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.