[英]KeyError on FeatureUnion between TfDif and custom features
I am trying to create a model where I'll use TfidfVectorizer on a text column and also a couple of other columns with extra data on the text. 我正在尝试创建一个模型,在该模型中,我将在文本列上使用TfidfVectorizer,并在文本上使用其他数据的其他两个列。 The code below reproduces what I'm trying to do and the error I get.
下面的代码再现了我正在尝试执行的操作以及出现的错误。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB
class ParStats(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
print(X[0])
return [{'feat_1': x['feat_1'],
'feat_2': x['feat_2']}
for x in X]
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
def feature_union_test():
# create test data frame
test_data = {
'text': ['And the silken, sad, uncertain rustling of each purple curtain',
'Thrilled me filled me with fantastic terrors never felt before',
'So that now, to still the beating of my heart, I stood repeating',
'Tis some visitor entreating entrance at my chamber door',
'Some late visitor entreating entrance at my chamber door',
'This it is and nothing more'],
'feat_1': [4, 7, 10, 7, 4, 6],
'feat_2': [1, 5, 5, 1, 1, 10],
'ignore': [1, 1, 1, 0, 0, 0]
}
test_df = pd.DataFrame(data=test_data)
y_train = test_df['ignore'].values.astype('int')
# Feature Union Pipeline
pipeline = FeatureUnion([
('text', Pipeline([
('selector', ItemSelector(key='text')),
('tfidf', TfidfVectorizer(max_df=0.5)),
])),
('parstats', Pipeline([
('stats', ParStats()),
('vect', DictVectorizer()),
]))
])
tfidf = pipeline.fit_transform(test_df)
# fits Naive Bayes
clf = BernoulliNB().fit(tfidf, y_train)
feature_union_test()
When I run this, I get the following error messages: 运行此命令时,出现以下错误消息:
Traceback (most recent call last):
File "C:\Users\Rogerio\Python VENV\lib\site-packages\pandas\core\indexes\base.py", line 3064, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
I've tried several different iterations of the pipeline and I always get some sort of error, so obviously I'm missing something. 我尝试了管道的几次不同迭代,但总是会遇到某种错误,因此很明显我遗漏了一些东西。 What am I doing wrong?
我究竟做错了什么?
The error occurs in transform
in your ParStats
class. 该错误发生在您的
ParStats
类中的transform
中。
First of all, pandas
doesn't support indexing directly, so your print(X[0])
is throwing the error you saw. 首先,
pandas
不直接支持索引编制,因此您的print(X[0])
抛出您看到的错误。
And you can't iterate a pandas
DataFrame in the way you are doing it. 而且,您无法按照自己的方式迭代
pandas
DataFrame。
Here is a possible working version of the function: 这是该功能的可能工作版本:
def transform(self, X):
return [{'feat_1': x[0], 'feat_2': x[1]}
for x in X[['feat_1', 'feat_2']].values]
Of course, there are a lot of other possible solutions, but you get the idea. 当然,还有很多其他可能的解决方案,但是您知道了。
Ok. 好。 So after discussion in comments, this is your problem statement.
因此,在评论中进行讨论之后,这就是您的问题陈述。
You want to pass the columns
feat_1
,feat_2
along with the tfidf oftext
column to your ml model.您想要将
feat_1
,feat_2
列以及text
列的feat_1
传递给ml模型。
So the only thing you need to do is this: 因此,您唯一需要做的就是:
# Feature Union Pipeline
pipeline = FeatureUnion([('text', Pipeline([('selector', ItemSelector(key='text')),
('tfidf', TfidfVectorizer(max_df=0.5)),
])),
('non_text', ItemSelector(key=['feat_1', 'feat_2']))
])
tfidf = pipeline.fit_transform(test_df)
The default ItemSelector
can be used to select multiple features at once which will be appended to the last of the tfidf data return from text
part of feature Union. 默认的
ItemSelector
可用于一次选择多个要素,这些要素将附加到从要素联盟的text
部分返回的tfidf数据的最后一部分。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.