[英]When I use imblearn pipeline instead of sklearn pipeline all textual features disappear. Any solution?
这是我下面的代码,我需要使用 SMOTENC 来平衡数据集,这意味着我必须使用 imblearn 库中的管道。 但是,它不识别 CountVectorizer 功能
from imblearn.pipeline import Pipeline
# from sklearn.pipeline import Pipeline
vectorizer_params = dict(ngram_range=(1, 2), min_df=200, max_df=0.8)
categorical_features = ['F1','F2','F3','F4']
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
textual_feature = ['F5']
text_transformer = Pipeline(
steps=[
("squeez", FunctionTransformer(lambda x: x.squeeze())),
("vect", CountVectorizer(**vectorizer_params)),
("tfidf", TfidfTransformer()),
("toarray", FunctionTransformer(lambda x: x.toarray())),
]
)
preprocessor = ColumnTransformer(
transformers=[
("cat", categorical_transformer, categorical_features),
("txt", text_transformer, textual_feature),
]
)
sgd_log_pipeline = Pipeline(
[
("preprocessor", preprocessor),
('smote', SMOTENC(random_state=11,categorical_features=[4,5,6,7])),
("clf", SGDClassifier()),
]
)
由于您使用的是 SMOTENC,因此无需进行一次热编码。 您可以查看源代码,并会看到它对您提供的分类特征执行一次热编码。
一种解决方案是对您的分类特征进行序数编码,并让 SMOTENC 将它们视为分类特征。
使用示例数据集:
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, OrdinalEncoder
from imblearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTENC
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train',categories=['rec.autos','sci.space','comp.graphics'])
n = len(newsgroups_train.data)
X = pd.DataFrame(np.random.choice(['A','B','C'],(n,4)),columns = ['F1','F2','F3','F4'])
X['F5'] = newsgroups_train.data
y = (newsgroups_train.target == 2).astype(int)
您的矢量化器部分:
vectorizer_params = dict(ngram_range=(1, 2), min_df=200, max_df=0.8)
textual_feature = ['F5']
text_transformer = Pipeline(
steps=[
("squeez", FunctionTransformer(lambda x: x.squeeze())),
("vect", CountVectorizer(**vectorizer_params)),
("tfidf", TfidfTransformer()),
("toarray", FunctionTransformer(lambda x: x.toarray())),
]
)
使用序数编码器而不是 onehot :
categorical_features = ['F1','F2','F3','F4']
categorical_transformer = OrdinalEncoder()
管道的其余部分,对于SMOTENC
的输入参数categorical_features=
,我们将放置前 4 列,因为您有 4 个分类特征:
preprocessor = ColumnTransformer(
transformers=[
("cat", categorical_transformer, categorical_features),
("txt", text_transformer, textual_feature),
]
)
sgd_log_pipeline = Pipeline(
[
("preprocessor", preprocessor),
('smote', SMOTENC(random_state=11,categorical_features=[0,1,2,3])),
("clf", SGDClassifier()),
]
)
因此,我们测试输入的预处理部分,从您的文本矢量化器中,我们期望输出:
text_transformer.fit_transform(X[textual_feature]).shape
(1771, 178)
连同我们的 4 个序数编码特征,预处理器的输出是我们所期望的:
preprocessor.fit_transform(X).shape
(1771, 182)
#for display purpose
preprocessor.fit_transform(X).round(3)
array([[0. , 0. , 0. , ..., 0.308, 0. , 0.361],
[1. , 2. , 2. , ..., 0.252, 0.12 , 0.099],
[2. , 0. , 1. , ..., 0.05 , 0. , 0. ],
...,
[1. , 1. , 2. , ..., 0.119, 0.226, 0. ],
[1. , 1. , 0. , ..., 0. , 0. , 0. ],
[1. , 0. , 1. , ..., 0. , 0. , 0. ]])
在这种情况下,前 4 列是您的 4 个分类特征,按序编码。 最终矩阵将是 4 +(来自文本矢量化器的特征)
让我们适应,我们可以检查进入分类器的内容:
sgd_log_pipeline.fit_resample(X,y)
sgd_log_pipeline.named_steps['clf'].n_features_in_
182
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.