简体   繁体   English

Scikit-Learn 自定义输入器,在平均值附近具有随机值

[英]Scikit-Learn Custom Imputer with random value around mean value

I want to create a Custom Imputer to replace NaN values in my data with a random value in the range of mean - std and mean + std for the column where the NaN value is in.我想创建一个自定义Imputer取代NaN与范围内的随机值在我的数据值mean - stdmean + std对于其中列NaN值在不在。

This is the code for the Imputer i have so far:这是迄今为止我拥有的 Imputer 的代码:

class GroupImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        X = check_array(X, force_all_finite=False)
        self.means = np.nanmean(X, axis=0)
        self.stds = np.nanstd(X, axis=0)
        return self

    def transform(self, X, y=None):
        check_is_fitted(self, 'means')
        check_is_fitted(self, 'stds')
        X = check_array(X, force_all_finite=False)
        # how do i apply to each row of the data?
        return 0

The self.means contains a list of the means for each column. self.means包含每个列的means列表。

The self.stds contains a list of all stds for each column. self.stds包含每列的所有stds的列表。

How do i apply a random value between mean - std and mean + std for each NaN in a row of the data?我如何为一行数据中的每个NaN应用mean - stdmean + std之间的随机值?

Do i have to iterate through the data?我必须遍历数据吗? ( for row in X: ) and pick the right mean and std according to the column index? for row in X: )并根据列索引选择正确的均值和标准差? Or is there a method which will do this?或者有没有一种方法可以做到这一点?

No, you don't have to iterate through the data, assume the number of rows and the number of columns of your data are 5 and 4, respectively不,您不必遍历数据,假设数据的行数和列数分别为 5 和 4

num_rows,num_cols = 5,4

# just fake two arrays of column means and stds
column_means = np.random.uniform(1,8,num_cols)
column_stds = np.random.rand(num_cols)

disp = np.random.uniform(column_means-column_stds,column_means+column_stds, size=(num_rows,num_cols))

The array disp is something like数组disp类似于

array([[6.29377845, 6.56185572, 5.32590954, 2.14719305],
       [6.36648777, 6.97781432, 4.89773801, 2.21909144],
       [5.38109603, 6.70649396, 5.50100582, 2.26518757],
       [5.59764259, 6.90297057, 5.65199988, 2.25340505],
       [5.80928963, 6.4976407 , 5.23792109, 1.99580784]])

in which each column of this array is uniformly sampled from the range (the column mean - the column std, the column mean + the column std) .其中该数组的每一列都是从范围(the column mean - the column std, the column mean + the column std)均匀采样的。 Therefore, the NaN entries of the original array can be replaced with the entries of disp .因此,原始数组的NaN条目可以替换为disp的条目。

No, there is a better option then iterate through the data.不,有一个更好的选择然后遍历数据。 You can create a uniformly random array (between the desired bounds) with the same shape) and replace every NaN value at index i with the random value at the same index.您可以创建一个具有相同形状的均匀随机数组(在所需边界之间),并将索引 i 处的每个 NaN 值替换为相同索引处的随机值。

higher_bound = self.means + self.stds
lower_bound = self.means - self.stds
random_values = numpy.random.uniform(low=lower_bound, high=higher_bound , size=X.shape) #uniformly random array with the same shape
nan_mask = np.isnan(X) #indicates where is nan
X = np.where(nan_mask, random_values, X) #takes from random_values where nan_mask is true, else takes  from original array

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM