简体   繁体   English

TfDif和自定义功能之间的FeatureUnion上的KeyError

[英]KeyError on FeatureUnion between TfDif and custom features

I am trying to create a model where I'll use TfidfVectorizer on a text column and also a couple of other columns with extra data on the text. 我正在尝试创建一个模型,在该模型中,我将在文本列上使用TfidfVectorizer,并在文本上使用其他数据的其他两个列。 The code below reproduces what I'm trying to do and the error I get. 下面的代码再现了我正在尝试执行的操作以及出现的错误。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB

class ParStats(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        print(X[0])
        return [{'feat_1': x['feat_1'],
                 'feat_2': x['feat_2']}
                for x in X]

class ItemSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

def feature_union_test():

    # create test data frame
    test_data = {
        'text': ['And the silken, sad, uncertain rustling of each purple curtain',
                 'Thrilled me filled me with fantastic terrors never felt before',
                 'So that now, to still the beating of my heart, I stood repeating',
                 'Tis some visitor entreating entrance at my chamber door',
                 'Some late visitor entreating entrance at my chamber door',
                 'This it is and nothing more'],
        'feat_1': [4, 7, 10, 7, 4, 6],
        'feat_2': [1, 5, 5, 1, 1, 10],
        'ignore': [1, 1, 1, 0, 0, 0]
    }
    test_df = pd.DataFrame(data=test_data)
    y_train = test_df['ignore'].values.astype('int')

    # Feature Union Pipeline
    pipeline = FeatureUnion([

                ('text', Pipeline([
                    ('selector', ItemSelector(key='text')),
                    ('tfidf', TfidfVectorizer(max_df=0.5)),
                ])),

                ('parstats', Pipeline([
                    ('stats', ParStats()),
                    ('vect', DictVectorizer()),
                ]))

            ])

    tfidf = pipeline.fit_transform(test_df)

    # fits Naive Bayes
    clf = BernoulliNB().fit(tfidf, y_train)

feature_union_test()

When I run this, I get the following error messages: 运行此命令时,出现以下错误消息:

Traceback (most recent call last):
  File "C:\Users\Rogerio\Python VENV\lib\site-packages\pandas\core\indexes\base.py", line 3064, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

I've tried several different iterations of the pipeline and I always get some sort of error, so obviously I'm missing something. 我尝试了管道的几次不同迭代,但总是会遇到某种错误,因此很明显我遗漏了一些东西。 What am I doing wrong? 我究竟做错了什么?

The error occurs in transform in your ParStats class. 该错误发生在您的ParStats类中的transform中。

First of all, pandas doesn't support indexing directly, so your print(X[0]) is throwing the error you saw. 首先, pandas不直接支持索引编制,因此您的print(X[0])抛出您看到的错误。

And you can't iterate a pandas DataFrame in the way you are doing it. 而且,您无法按照自己的方式迭代pandas DataFrame。

Here is a possible working version of the function: 这是该功能的可能工作版本:

def transform(self, X):
    return [{'feat_1': x[0], 'feat_2': x[1]} 
            for x in X[['feat_1', 'feat_2']].values]

Of course, there are a lot of other possible solutions, but you get the idea. 当然,还有很多其他可能的解决方案,但是您知道了。

Ok. 好。 So after discussion in comments, this is your problem statement. 因此,在评论中进行讨论之后,这就是您的问题陈述。

You want to pass the columns feat_1 , feat_2 along with the tfidf of text column to your ml model. 您想要将feat_1feat_2列以及text列的feat_1传递给ml模型。

So the only thing you need to do is this: 因此,您唯一需要做的就是:

# Feature Union Pipeline
pipeline = FeatureUnion([('text', Pipeline([('selector', ItemSelector(key='text')),
                                            ('tfidf', TfidfVectorizer(max_df=0.5)),
                                           ])),
                         ('non_text', ItemSelector(key=['feat_1', 'feat_2']))
                        ])

tfidf = pipeline.fit_transform(test_df)

The default ItemSelector can be used to select multiple features at once which will be appended to the last of the tfidf data return from text part of feature Union. 默认的ItemSelector可用于一次选择多个要素,这些要素将附加到从要素联盟的text部分返回的tfidf数据的最后一部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM