繁体   English   中英

情感分析管道,在使用特征选择时获取正确的特征名称的问题

[英]Sentiment analysis Pipeline, problem getting the correct feature names when feature selection is used

在以下示例中,我使用twitter数据集来执行情绪分析。 我使用sklearn管道执行一系列转换,添加功能并添加分类。 最后一步是可视化具有更高预测能力的单词。 当我不使用功能选择时,它工作正常。 但是,当我使用它时,我得到的结果毫无意义。 我怀疑当应用特征选择时,文本特征的顺序会发生变化。 有办法解决这个问题吗?

以下代码已更新,包括正确答案

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion

features= [c for c in df.columns.values if c  not in ['target']]
target = 'target'

#train test split
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2,stratify = df5[target], random_state=0)

#Create classes which allow to select specific columns from the dataframe

class NumberSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

class TextSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]

class ColumnExtractor(TransformerMixin):

    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        # stateless transformer
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xcols = X[self.cols]

        return Xcols

class DummyTransformer(TransformerMixin):

    def __init__(self):
        self.dv = None

    def fit(self, X, y=None):
        # assumes all columns of X are strings
        Xdict = X.to_dict('records')
        self.dv = DictVectorizer(sparse=False)
        self.dv.fit(Xdict)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xdict = X.to_dict('records')
        Xt = self.dv.transform(Xdict)
        cols = self.dv.get_feature_names()
        Xdum = pd.DataFrame(Xt, index=X.index, columns=cols)

        # drop column indicating NaNs

        nan_cols = [c for c in cols if '=' not in c]
        Xdum = Xdum.drop(nan_cols, axis=1)
        Xdum.drop(list(Xdum.filter(regex = 'unknown')), axis = 1, inplace = True)

        return Xdum

def pipelinize(function, active=True):
    def list_comprehend_a_function(list_or_series, active=True):
        if active:
            return [function(i) for i in list_or_series]
        else: # if it's not active, just pass it right back
            return list_or_series
    return FunctionTransformer(list_comprehend_a_function, validate=False, kw_args={'active':active})

#function to plot the coeficients of the words in the text with the highest predictive power
def plot_coefficients(classifier, feature_names, top_features=50):

    if classifier.__class__.__name__ == 'SVC':
        coef = classifier.coef_
        coef2 = coef.toarray().ravel()
        coef1 = coef2[:len(feature_names)]

    else:
        coef1 = classifier.coef_.ravel()

    top_positive_coefficients = np.argsort(coef1)[-top_features:]
    top_negative_coefficients = np.argsort(coef1)[:top_features]
    top_coefficients = np.hstack([top_negative_coefficients, top_positive_coefficients])
     # create plot
    plt.figure(figsize=(15, 5))
    colors = ['red' if c < 0 else 'blue' for c in coef1[top_coefficients]]
    plt.bar(np.arange(2 * top_features), coef1[top_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(1, 1 + 2 * top_features), feature_names[top_coefficients], rotation=90, ha='right')
    plt.show()

#create a custome stopwords list
stop_list = stopwords(remove_stop_word ,add_stop_word )

#vectorizer
tfidf=TfidfVectorizer(sublinear_tf=True, stop_words = set(stop_list),ngram_range = (1,2))

#categorical features
CAT_FEATS = ['location','account']

#dimensionality reduction
pca = TruncatedSVD(n_components=200)

#scaler for numerical features
scaler = StandardScaler()

#classifier
model = SVC(kernel = 'linear', probability=True, C=1, class_weight = 'balanced')

text = Pipeline([('selector', TextSelector(key='content')),('text_preprocess', pipelinize(text_preprocessing)),('vectorizer',tfidf),('important_features',select)])
followers =  Pipeline([('selector', NumberSelector(key='followers')),('scaler', scaler)])
location = Pipeline([('selector',ColumnExtractor(CAT_FEATS)),('scaler',DummyTransformer())])
feats = FeatureUnion([('text', text), ('length', followers), ('location',location)])
pipeline = Pipeline([('features',feats),('classifier', model)])
pipeline.fit(X_train, y_train)

preds = pipeline.predict(X_test)

feature_names = text.named_steps['vectorizer'].get_feature_names()
feature_names = np.array(feature_names)[text.named_steps['important_features'].get_support(True)]

classifier = pipe.named_steps['classifier']

plot_coefficients(classifier, feature_names)

在选择特征之前 在此输入图像描述

选择功能后 在此输入图像描述

要使用功能选择,请更改以下代码行

text = Pipeline([('selector', TextSelector(key='content')),
                 ('text_preprocess', pipelinize(text_preprocessing)),
                 ('vectorizer',tfidf)])

select = SelectKBest(f_classif, k=8000)
text = Pipeline([('selector', TextSelector(key='content')),
                 ('text_preprocess', pipelinize(text_preprocessing)), 
                 ('vectorizer',tfidf), 
                 ('important_features',select)])

为什么会这样

发生这种情况是因为功能选择选择了最重要的功能并丢弃了另一个功能,因此索引不再有意义。

假设您有以下示例:

X = np.array(["This is the first document","This is the second document",
"This is the first again"])
y = np.array([0,1,0])

显然,推动分类的两个主要词是“第一”和“第二”。 使用与您类似的管道,您可以:

tfidf = TfidfVectorizer()
sel = SelectKBest(k = 2)
pipe = Pipeline([('vectorizer',tfidf), ('select',sel)])
pipe.fit(X,y)

feature_names = np.array(pipe['vectorizer'].get_feature_names())
feature_names[pipe['select'].get_support(True)]

>>> array(['first', 'second'], dtype='<U8')

因此,您需要做的不仅是从tfidf矢量化中获取特征,还要通过pipe['select'].get_support(True)特征选择保留的索引pipe['select'].get_support(True)

您的代码中需要更改的内容

因此,您应该在代码中更改的是添加以下代码行:

feature_names = text.named_steps['vectorizer'].get_feature_names()
## Add this line
feature_names = feature_names[text['important_features'].get_support(True)]
##
classifier = pipe.named_steps['classifier']
plot_coefficients(classifier, feature_names)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM