簡體   English   中英

在 scikit-learn 中估算分類缺失值

[英]Impute categorical missing values in scikit-learn

我有一些帶有文本類型列的熊貓數據。 這些文本列還有一些 NaN 值。 我想要做的是通過sklearn.preprocessing.Imputer (用最頻繁的值替換 NaN)來sklearn.preprocessing.Imputer這些 NaN。 問題出在執行上。 假設有一個 Pandas 數據框 df 有 30 列,其中 10 列是分類性質的。 一旦我運行:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df) 

Python 生成error: 'could not convert string to float: 'run1'' ,其中 'run1' 是來自具有分類數據的第一列的普通(非缺失)值。

任何幫助將非常受歡迎

要使用數字列的平均值和非數字列的最常見值,您可以執行以下操作。 您可以進一步區分整數和浮點數。 我想對整數列使用中位數可能更有意義。

import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]

X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)

print('before...')
print(X)
print('after...')
print(xt)

打印,

before...
     0   1   2
0    a   1   2
1    b   1   1
2    b   2   2
3  NaN NaN NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667

您可以將sklearn_pandas.CategoricalImputer用於分類列。 細節:

首先,(來自《Hands-On Machine Learning with Scikit-Learn and TensorFlow》一書)您可以擁有數字和字符串/分類特征的子管道,其中每個子管道的第一個轉換器是一個選擇器,它采用列名列表(以及full_pipeline.fit_transform()需要一個熊貓數據幀):

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

然后,您可以將這些子管道與sklearn.pipeline.FeatureUnion結合,例如:

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])

現在,在num_pipeline您可以簡單地使用sklearn.preprocessing.Imputer() ,但在cat_pipline ,您可以使用sklearn_pandas包中的CategoricalImputer()

注意: sklearn-pandas包可以通過pip install sklearn-pandas ,但它是作為import sklearn_pandas

有一個包sklearn-pandas可以選擇分類變量https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)

復制和修改 sveitser 的答案,我為 pandas.Series 對象做了一個輸入器

import numpy
import pandas 

from sklearn.base import TransformerMixin

class SeriesImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  

        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self

    def transform(self, X, y=None):
       return X.fillna(self.fill)

要使用它,您將執行以下操作:

# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])


a  = SeriesImputer()   # Initialize the imputer
a.fit(s1)              # Fit the imputer
s2 = a.transform(s1)   # Get a new series
  • strategy = 'most_frequent' 只能用於定量特征,不能用於定性特征。 這種定制的impuer可用於定性和定量。 同樣使用 scikit learn imputer,我們可以將它用於整個數據框(如果所有特征都是定量的),或者我們可以將“for 循環”與類似類型的特征/列列表一起使用(參見下面的示例)。 但是自定義輸入器可以與任何組合一起使用。

     from sklearn.preprocessing import Imputer impute = Imputer(strategy='mean') for cols in ['quantitative_column', 'quant']: # here both are quantitative features. xx[cols] = impute.fit_transform(xx[[cols]])
  • 自定義輸入器:

     from sklearn.preprocessing import Imputer from sklearn.base import TransformerMixin class CustomImputer(TransformerMixin): def __init__(self, cols=None, strategy='mean'): self.cols = cols self.strategy = strategy def transform(self, df): X = df.copy() impute = Imputer(strategy=self.strategy) if self.cols == None: self.cols = list(X.columns) for col in self.cols: if X[col].dtype == np.dtype('O') : X[col].fillna(X[col].value_counts().index[0], inplace=True) else : X[col] = impute.fit_transform(X[[col]]) return X def fit(self, *_): return self
  • 數據框:

     X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san francisco', 'tokyo'], 'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 'somewhat like', 'dislike'], 'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]}) city boolean ordinal_column quantitative_column 0 tokyo yes somewhat like 1.0 1 NaN no like 11.0 2 london NaN somewhat like -0.5 3 seattle no like 10.0 4 san francisco no somewhat like NaN 5 tokyo yes dislike 20.0
  • 1) 可以與相似類型的特征列表一起使用。

     cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean cci.fit_transform(X)
  • 可以與策略=中位數一起使用

     sd = CustomImputer(['quantitative_column'], strategy = 'median') sd.fit_transform(X)
  • 3)可用於整個數據框,它將使用默認均值(或者我們也可以用中值更改它。對於定性特征,它使用 strategy = 'most_frequent' 和定量均值/中值。

     call = CustomImputer() call.fit_transform(X)

受到這里答案的啟發,並且由於需要所有用例的 goto Imputer,我最終寫了這篇文章。 它支持四種用於插補mean, mode, median, fill pd.DataFrame mean, mode, median, fill策略mean, mode, median, fill適用於pd.DataFramePd.Series

meanmedian僅適用於數字數據, modefill適用於數字和分類數據。

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy='mean',filler='NA'):
       self.strategy = strategy
       self.fill = filler

    def fit(self, X, y=None):
       if self.strategy in ['mean','median']:
           if not all(X.dtypes == np.number):
               raise ValueError('dtypes mismatch np.number dtype is \
                                 required for '+ self.strategy)
       if self.strategy == 'mean':
           self.fill = X.mean()
       elif self.strategy == 'median':
           self.fill = X.median()
       elif self.strategy == 'mode':
           self.fill = X.mode().iloc[0]
       elif self.strategy == 'fill':
           if type(self.fill) is list and type(X) is pd.DataFrame:
               self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
       return self

   def transform(self, X, y=None):
       return X.fillna(self.fill)

用法

>> df   
    MasVnrArea  FireplaceQu
Id  
1   196.0   NaN
974 196.0   NaN
21  380.0   Gd
5   350.0   TA
651 NaN     Gd


>> CustomImputer(strategy='mode').fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   Gd
974 196.0   Gd
21  380.0   Gd
5   350.0   TA
651 196.0   Gd

>> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   NA
974 196.0   NA
21  380.0   Gd
5   350.0   TA
651 0.0     Gd 

這段代碼用最頻繁的類別填充了一個系列:

import pandas as pd
import numpy as np

# create fake data 
m = pd.Series(list('abca'))
m.iloc[1] = np.nan #artificially introduce nan

print('m = ')
print(m)

#make dummy variables, count and sort descending:
most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] 

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

new_m = m.map(replace_most_common) #apply function to original data

print('new_m = ')
print(new_m)

輸出:

m =
0      a
1    NaN
2      c
3      a
dtype: object

new_m =
0    a
1    a
2    c
3    a
dtype: object

sklearn.impute.SimpleImputer 而不是 Imputer 可以輕松解決這個問題,它可以處理分類變量。

根據 Sklearn 文檔:如果“最頻繁”,則使用每列中最頻繁的值替換缺失值。 可用於字符串或數字數據。

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

impute_size=SimpleImputer(strategy="most_frequent") 
data['Outlet_Size']=impute_size.transform(data[['Outlet_Size']])

相似的。 strategy='most_frequent'修改Imputer器:

class GeneralImputer(Imputer):
    def __init__(self, **kwargs):
        Imputer.__init__(self, **kwargs)

    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = pd.DataFrame(X).mode(axis=0).squeeze()
            self.statistics_ = self.fills.values
            return self
        else:
            return Imputer.fit(self, X, y=y)

    def transform(self, X):
        if hasattr(self, 'fills'):
            return pd.DataFrame(X).fillna(self.fills).values.astype(str)
        else:
            return Imputer.transform(self, X)

其中pandas.DataFrame.mode()為每列找到最頻繁的值,然后pandas.DataFrame.fillna()用這些值填充缺失值。 其他strategy值仍由Imputer以相同方式處理。

您可以嘗試以下操作:

replace = df.<yourcolumn>.value_counts().argmax()

df['<yourcolumn>'].fillna(replace, inplace=True) 

Missforest 可用於對分類變量中的缺失值以及其他分類特征進行插補。 它以類似於 IterativeImputer 的迭代方式工作,以隨機森林為基礎模型。

以下是對特征和目標變量進行標簽編碼、擬合模型以估算 nan 值並將特征編碼回的代碼

import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest

def label_encoding(df, columns):
    """
    Label encodes the set of the features to be used for imputation
    Args:
        df: data frame (processed data)
        columns: list (features to be encoded)
    Returns: dictionary
    """
    encoders = dict()
    for col_name in columns:
        series = df[col_name]
        label_encoder = LabelEncoder()
        df[col_name] = pd.Series(
            label_encoder.fit_transform(series[series.notnull()]),
            index=series[series.notnull()].index
        )
        encoders[col_name] = label_encoder
    return encoders

# adding to be imputed global category along with features
features = ['feature_1', 'feature_2', 'target_variable']
# label encoding features
encoders = label_encoding(data, features)
# categorical imputation using random forest 
# parameters can be tuned accordingly
imp_cat = MissForest(n_estimators=50, max_depth=80)
data[features] = imp_cat.fit_transform(data[features], cat_vars=[0, 1, 2])
# decoding features
for variable in features:
    data[variable] = encoders[variable].inverse_transform(data[variable].astype(int))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM