[英]FeatureUnion , pipeline categorical features with tfidf features throwing error
I am trying to concat features from tfidf and other categorical features to perform classification on the resultant dataset. 我正在尝试从tfidf和其他分类特征中合并特征,以对结果数据集执行分类。 From various blogs I understand that FeatureUnion can be used to concat the features and then pipeline the same to algorithm (in my case Naive bayes).
从各种博客中,我了解到可以使用FeatureUnion来合并功能,然后将其通过管道传递给算法(在我的案例中是朴素贝叶斯)。
I have followed the code from this link - http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html 我遵循了此链接中的代码-http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
When I try to execute the code it is giving error 当我尝试执行代码时,出现错误
TypeError: no supported conversion for types: (dtype('O'),)
Below is the code which I am trying to execute: 以下是我要执行的代码:
class textdata():
def transform(self, X, Y):
return X[desc]
def fit(self, X, Y):
return self
class one_hot_trans():
def transform(self, X, Y):
X = pd.get_dummies(X, columns=obj_cols)
return X
def fit(self, X, Y):
return self
pipeline = Pipeline([
('features', FeatureUnion([
('ngram_tf_idf', Pipeline([
('text', textdata()),
('tf_idf', TfidfTransformer())
])),
('one_hot', one_hot_trans())
])),
('classifier', MultinomialNB())
])
d_train, d_test, y_train, y_test = train_test_split(data, data[target], test_size=0.2, random_state = 2018)
pipeline.fit(d_train, y_train)
Can anyone help me in resolving this error. 谁能帮助我解决这个错误。
Note: data has total 9 columns with 1 target variable (categorical) and 1 text column (on which I want to perform tfidf) and rest are categorical (obj_cols in above code). 注意:数据总共有9列,其中包含1个目标变量(分类)和1个文本列(我要在其上执行tfidf),其余部分是分类的(上述代码中的obj_cols)。
Edit: Thanks Vivek. 编辑:谢谢Vivek。 I did not notice that.
我没有注意到。 It was by mistake i have put transformer instead of Vectorizer.
我把变压器而不是Vectorizer放错了。 Even after replacing I am getting below error.
即使更换后,我也跌破了错误。
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, weight, X, y, **fit_params)
579 **fit_params):
580 if hasattr(transformer, 'fit_transform'):
--> 581 res = transformer.fit_transform(X, y, **fit_params)
582 else:
583 res = transformer.fit(X, y, **fit_params).transform(X)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
745 self._update_transformer_list(transformers)
746 if any(sparse.issparse(f) for f in Xs):
--> 747 Xs = sparse.hstack(Xs).tocsr()
748 else:
749 Xs = np.hstack(Xs)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
462
463 """
--> 464 return bmat([blocks], format=format, dtype=dtype)
465
466
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
598 if dtype is None:
599 all_dtypes = [blk.dtype for blk in blocks[block_mask]]
--> 600 dtype = upcast(*all_dtypes) if all_dtypes else None
601
602 row_offsets = np.append(0, np.cumsum(brow_lengths))
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\sputils.py in upcast(*args)
50 return t
51
---> 52 raise TypeError('no supported conversion for types: %r' % (args,))
53
54
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
Edit:: 编辑::
I have checked for the unique values in all the categorical variables except for description column and I found none of the values appearing in test data which are not there in train. 我已经检查了除描述列以外的所有类别变量中的唯一值,并且我发现测试数据中没有出现任何不在火车上的值。 Am I doing something wrong.
难道我做错了什么。
for col in d_train.columns.drop(desc):
ext = set(d_test[col].unique().tolist()) - set(d_train[col].unique().tolist())
if ext: print ("extra columns: \n\n", ext)
Edit2:: Additional info - details of the d_train, d_test features mentioned. Edit2 ::其他信息-提及的d_train,d_test功能的详细信息。 Can anyone help I am still getting "dimension mismatch" error on predict method.
谁能帮我在预测方法上仍然出现“尺寸不匹配”错误。
obj cols:: ['priority', 'ticket_type', 'created_group', 'Classification', 'Component', 'ATR_OWNER_PLANT', 'created_day']
d_train cols:: Index(['priority', 'ticket_type', 'created_group', 'Description_ticket', 'Classification', 'Component', 'ATR_OWNER_PLANT', 'created_day'], dtype='object')
d_test cols:: Index(['priority', 'ticket_type', 'created_group', 'Description_ticket','Classification', 'Component', 'ATR_OWNER_PLANT', 'created_day'], dtype='object')
d_train shape:: (95080, 8)
d_test shape:: (23770, 8)
desc:: Description_ticket
I think, you are passing text column also through one_hot_trans
function. 我认为,您也在通过
one_hot_trans
函数传递文本列。
Can you try making the output of one_hot_trans as following. 您可以尝试按以下方式制作one_hot_trans的输出吗?
class one_hot_trans():
def transform(self, X, Y):
X = pd.get_dummies(X.drop(desc,axis=1), obj_cols])
return X
def fit(self, X, Y):
return self
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.