简体   繁体   English

使用scikit-learn进行文档分类:获取对分类影响更大的单词(令牌)的最有效方法

[英]Document Classification with scikit-learn: most efficient way to get the words (token) that impacted more on the classification

I have built a document binomial classifier using a tf-idf representation of a training set of documents and applying Logistic Regression to it: 我使用训练文档集的tf-idf表示形式构建了文档二项式分类器,并对其应用了Logistic回归:

lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])

lr_tfidf.fit(X_train, y_train)

I have saved the model in pickle format and used it to classify new documents, getting as a result the probability the document is in class A and the probability the model is in class B. 我已将模型保存为pickle格式,并用它来对新文档进行分类,结果得到该文档属于A类的概率和该模型属于B类的概率。

text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)

Which is the best way to also get the words (or, in general, the tokens) that impacted more in the classification? 哪种方法也能获得在分类中影响更大的单词(或通常是标记)? I would expect to get: 我希望得到:

  • The N tokens contained in the document which had the higher coefficient as a feature in the Logistic Regression model 作为Logistic回归模型的特征,文档中包含的N个令牌具有较高的系数
  • The N tokens contained in the document which had the lower coefficient as a feature in the Logistic Regression model 作为Logistic回归模型的特征,文档中包含的N个标记具有较低的系数

I am using sklearn v 0.19 我正在使用sklearn v 0.19

There is a solution on GitHub to print the most important features obtained from a classifier within a pipeline: GitHub上有一个解决方案可以打印从管道中的分类器获得的最重要的功能:

https://gist.github.com/bbengfort/044682e76def583a12e6c09209c664a1 https://gist.github.com/bbengfort/044682e76def583a12e6c09209c664a1

You want to use show_most_informative_features function in their script. 您想要在脚本中使用show_most_informative_features函数。 I used it and it works fine. 我用它,它工作正常。

Here is a copy-paste of the Github poster's code: 这是Github发布者代码的复制粘贴:

def show_most_informative_features(model, text=None, n=20):

"""

Accepts a Pipeline with a classifer and a TfidfVectorizer and computes

the n most informative features of the model. If text is given, then will

compute the most informative features for classifying that text.



Note that this function will only work on linear models with coefs_

"""

# Extract the vectorizer and the classifier from the pipeline

vectorizer = model.named_steps['vectorizer']

classifier = model.named_steps['classifier']



# Check to make sure that we can perform this computation

if not hasattr(classifier, 'coef_'):

    raise TypeError(

        "Cannot compute most informative features on {} model.".format(

            classifier.__class__.__name__

        )

    )



if text is not None:

    # Compute the coefficients for the text

    tvec = model.transform([text]).toarray()

else:

    # Otherwise simply use the coefficients

    tvec = classifier.coef_



# Zip the feature names with the coefs and sort

coefs = sorted(

    zip(tvec[0], vectorizer.get_feature_names()),

    key=itemgetter(0), reverse=True

)



topn  = zip(coefs[:n], coefs[:-(n+1):-1])



# Create the output string to return

output = []



# If text, add the predicted value to the output.

if text is not None:

    output.append("\"{}\"".format(text))

    output.append("Classified as: {}".format(model.predict([text])))

    output.append("")



# Create two columns with most negative and most positive features.

for (cp, fnp), (cn, fnn) in topn:

    output.append(

        "{:0.4f}{: >15}    {:0.4f}{: >15}".format(cp, fnp, cn, fnn)

    )



return "\n".join(output)

Here is a modified version of the show_most_informative_features function that works with any classifier: 这是show_most_informative_features函数的修改版本,可与任何分类器一起使用:

def show_most_informative_features(model, vectorizer=None, text=None, n=20):
# Extract the vectorizer and the classifier from the pipeline
if vectorizer is None:
    vectorizer = model.named_steps['vectorizer']
else:
    vectorizer.fit_transform([text])

classifier = model.named_steps['classifier']
feat_names = vectorizer.get_feature_names()

# Check to make sure that we can perform this computation
if not hasattr(classifier, 'coef_'):
    raise TypeError(
        "Cannot compute most informative features on {}.".format(
            classifier.__class__.__name__
        )
    )    

# Otherwise simply use the coefficients
tvec = classifier.coef_

# Zip the feature names with the coefs and sort   
coefs = sorted(
    zip(tvec[0], feat_names),
    key=operator.itemgetter(0), reverse=True
)

# Get the top n and bottom n coef, name pairs
topn  = zip(coefs[:n], coefs[:-(n+1):-1])

# Create the output string to return
output = []

# If text, add the predicted value to the output.
if text is not None:
    output.append("\"{}\"".format(text))
    output.append(
        "Classified as: {}".format(model.predict([text]))
    )
    output.append("")

# Create two columns with most negative and most positive features.
for (cp, fnp), (cn, fnn) in topn:
    output.append(
        "{:0.4f}{: >15}    {:0.4f}{: >15}".format(
            cp, fnp, cn, fnn
        )
    )

return "\n".join(output)

Then you can call the function like this: 然后,您可以像下面这样调用函数:

vectorizer = TfidfVectorizer()
show_most_informative_features(model,vectorizer, "your text")

From my understanding, you just want to look at the parameters and sort the according to the coefficient value. 据我了解,您只需要查看参数并根据系数值对它们进行排序。 With .get_params() function, you can get the coefficients. 使用.get_params()函数,您可以获取系数。 You can argsort it and select the top N, bot N. 您可以对其进行argsort并选择前N个,botN。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM