[英]Document Classification with scikit-learn: most efficient way to get the words (token) that impacted more on the classification
我使用训练文档集的tf-idf表示形式构建了文档二项式分类器,并对其应用了Logistic回归:
lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])
lr_tfidf.fit(X_train, y_train)
我已将模型保存为pickle格式,并用它来对新文档进行分类,结果得到该文档属于A类的概率和该模型属于B类的概率。
text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)
哪种方法也能获得在分类中影响更大的单词(或通常是标记)? 我希望得到:
我正在使用sklearn v 0.19
GitHub上有一个解决方案可以打印从管道中的分类器获得的最重要的功能:
https://gist.github.com/bbengfort/044682e76def583a12e6c09209c664a1
您想要在脚本中使用show_most_informative_features
函数。 我用它,它工作正常。
这是Github发布者代码的复制粘贴:
def show_most_informative_features(model, text=None, n=20):
"""
Accepts a Pipeline with a classifer and a TfidfVectorizer and computes
the n most informative features of the model. If text is given, then will
compute the most informative features for classifying that text.
Note that this function will only work on linear models with coefs_
"""
# Extract the vectorizer and the classifier from the pipeline
vectorizer = model.named_steps['vectorizer']
classifier = model.named_steps['classifier']
# Check to make sure that we can perform this computation
if not hasattr(classifier, 'coef_'):
raise TypeError(
"Cannot compute most informative features on {} model.".format(
classifier.__class__.__name__
)
)
if text is not None:
# Compute the coefficients for the text
tvec = model.transform([text]).toarray()
else:
# Otherwise simply use the coefficients
tvec = classifier.coef_
# Zip the feature names with the coefs and sort
coefs = sorted(
zip(tvec[0], vectorizer.get_feature_names()),
key=itemgetter(0), reverse=True
)
topn = zip(coefs[:n], coefs[:-(n+1):-1])
# Create the output string to return
output = []
# If text, add the predicted value to the output.
if text is not None:
output.append("\"{}\"".format(text))
output.append("Classified as: {}".format(model.predict([text])))
output.append("")
# Create two columns with most negative and most positive features.
for (cp, fnp), (cn, fnn) in topn:
output.append(
"{:0.4f}{: >15} {:0.4f}{: >15}".format(cp, fnp, cn, fnn)
)
return "\n".join(output)
这是show_most_informative_features
函数的修改版本,可与任何分类器一起使用:
def show_most_informative_features(model, vectorizer=None, text=None, n=20):
# Extract the vectorizer and the classifier from the pipeline
if vectorizer is None:
vectorizer = model.named_steps['vectorizer']
else:
vectorizer.fit_transform([text])
classifier = model.named_steps['classifier']
feat_names = vectorizer.get_feature_names()
# Check to make sure that we can perform this computation
if not hasattr(classifier, 'coef_'):
raise TypeError(
"Cannot compute most informative features on {}.".format(
classifier.__class__.__name__
)
)
# Otherwise simply use the coefficients
tvec = classifier.coef_
# Zip the feature names with the coefs and sort
coefs = sorted(
zip(tvec[0], feat_names),
key=operator.itemgetter(0), reverse=True
)
# Get the top n and bottom n coef, name pairs
topn = zip(coefs[:n], coefs[:-(n+1):-1])
# Create the output string to return
output = []
# If text, add the predicted value to the output.
if text is not None:
output.append("\"{}\"".format(text))
output.append(
"Classified as: {}".format(model.predict([text]))
)
output.append("")
# Create two columns with most negative and most positive features.
for (cp, fnp), (cn, fnn) in topn:
output.append(
"{:0.4f}{: >15} {:0.4f}{: >15}".format(
cp, fnp, cn, fnn
)
)
return "\n".join(output)
然后,您可以像下面这样调用函数:
vectorizer = TfidfVectorizer()
show_most_informative_features(model,vectorizer, "your text")
据我了解,您只需要查看参数并根据系数值对它们进行排序。 使用.get_params()函数,您可以获取系数。 您可以对其进行argsort并选择前N个,botN。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.