使用scikit-learn进行文档分类：获取对分类影响更大的单词（令牌）的最有效方法

Question

我使用训练文档集的tf-idf表示形式构建了文档二项式分类器，并对其应用了Logistic回归：

lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])

lr_tfidf.fit(X_train, y_train)

我已将模型保存为pickle格式，并用它来对新文档进行分类，结果得到该文档属于A类的概率和该模型属于B类的概率。

text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)

哪种方法也能获得在分类中影响更大的单词（或通常是标记）？ 我希望得到：

作为Logistic回归模型的特征，文档中包含的N个令牌具有较高的系数
作为Logistic回归模型的特征，文档中包含的N个标记具有较低的系数

我正在使用sklearn v 0.19

Answer 1

GitHub上有一个解决方案可以打印从管道中的分类器获得的最重要的功能：

https://gist.github.com/bbengfort/044682e76def583a12e6c09209c664a1

您想要在脚本中使用show_most_informative_features函数。 我用它，它工作正常。

这是Github发布者代码的复制粘贴：

def show_most_informative_features(model, text=None, n=20):

"""

Accepts a Pipeline with a classifer and a TfidfVectorizer and computes

the n most informative features of the model. If text is given, then will

compute the most informative features for classifying that text.



Note that this function will only work on linear models with coefs_

"""

# Extract the vectorizer and the classifier from the pipeline

vectorizer = model.named_steps['vectorizer']

classifier = model.named_steps['classifier']



# Check to make sure that we can perform this computation

if not hasattr(classifier, 'coef_'):

    raise TypeError(

        "Cannot compute most informative features on {} model.".format(

            classifier.__class__.__name__

        )

    )



if text is not None:

    # Compute the coefficients for the text

    tvec = model.transform([text]).toarray()

else:

    # Otherwise simply use the coefficients

    tvec = classifier.coef_



# Zip the feature names with the coefs and sort

coefs = sorted(

    zip(tvec[0], vectorizer.get_feature_names()),

    key=itemgetter(0), reverse=True

)



topn  = zip(coefs[:n], coefs[:-(n+1):-1])



# Create the output string to return

output = []



# If text, add the predicted value to the output.

if text is not None:

    output.append("\"{}\"".format(text))

    output.append("Classified as: {}".format(model.predict([text])))

    output.append("")



# Create two columns with most negative and most positive features.

for (cp, fnp), (cn, fnn) in topn:

    output.append(

        "{:0.4f}{: >15}    {:0.4f}{: >15}".format(cp, fnp, cn, fnn)

    )



return "\n".join(output)

Answer 2

这是show_most_informative_features函数的修改版本，可与任何分类器一起使用：

def show_most_informative_features(model, vectorizer=None, text=None, n=20):
# Extract the vectorizer and the classifier from the pipeline
if vectorizer is None:
    vectorizer = model.named_steps['vectorizer']
else:
    vectorizer.fit_transform([text])

classifier = model.named_steps['classifier']
feat_names = vectorizer.get_feature_names()

# Check to make sure that we can perform this computation
if not hasattr(classifier, 'coef_'):
    raise TypeError(
        "Cannot compute most informative features on {}.".format(
            classifier.__class__.__name__
        )
    )    

# Otherwise simply use the coefficients
tvec = classifier.coef_

# Zip the feature names with the coefs and sort   
coefs = sorted(
    zip(tvec[0], feat_names),
    key=operator.itemgetter(0), reverse=True
)

# Get the top n and bottom n coef, name pairs
topn  = zip(coefs[:n], coefs[:-(n+1):-1])

# Create the output string to return
output = []

# If text, add the predicted value to the output.
if text is not None:
    output.append("\"{}\"".format(text))
    output.append(
        "Classified as: {}".format(model.predict([text]))
    )
    output.append("")

# Create two columns with most negative and most positive features.
for (cp, fnp), (cn, fnn) in topn:
    output.append(
        "{:0.4f}{: >15}    {:0.4f}{: >15}".format(
            cp, fnp, cn, fnn
        )
    )

return "\n".join(output)

然后，您可以像下面这样调用函数：

vectorizer = TfidfVectorizer()
show_most_informative_features(model,vectorizer, "your text")

Answer 3

据我了解，您只需要查看参数并根据系数值对它们进行排序。 使用.get_params（）函数，您可以获取系数。 您可以对其进行argsort并选择前N个，botN。

使用scikit-learn进行文档分类：获取对分类影响更大的单词（令牌）的最有效方法

问题描述

3 个解决方案

解决方案1
0 2018-02-14 17:30:56

解决方案2
0 2018-02-23 21:14:17

解决方案3
-1 2018-01-24 07:21:42

使用scikit-learn进行文档分类：获取对分类影响更大的单词（令牌）的最有效方法

问题描述

3 个解决方案

解决方案1 0 2018-02-14 17:30:56

解决方案2 0 2018-02-23 21:14:17

解决方案3 -1 2018-01-24 07:21:42

解决方案1
0 2018-02-14 17:30:56

解决方案2
0 2018-02-23 21:14:17

解决方案3
-1 2018-01-24 07:21:42