简体繁体 English

查找Tf-Idf使用scikit-learn从文档集中仅选择单词的分数

[英]Finding Tf-Idf Scores of only selected words from set of documents using scikit-learn

原文 2016-03-16 16:38:51 9 1 python/ scipy/ nlp/ scikit-learn/ tf-idf

I have a set of documents (stored as .txt files). 我有一组文件（存储为.txt文件）。 I Also have a python dictionary of some selected words. 我还有一些选定单词的python字典。 I want to assign tf-idf scores only to these words, and not all words, from the set of documents. 我想只为这些单词分配tf-idf分数，而不是从文档集中分配所有单词。 How can this be done using scikit-learn or any other library ? 如何使用scikit-learn或任何其他库来完成？

I have referred to this blog post but it gives scores of full vocabulary. 我已经提到了这篇博文，但它提供了大量的完整词汇。

1 个解决方案

You can do it with CountVectorizer , which scans the document as text and converts into a term-document matrix, and using TfidfTrasnformer on the matrix. 您可以使用CountVectorizer执行此操作， CountVectorizer将文档扫描为文本并转换为术语文档矩阵，并在矩阵上使用TfidfTrasnformer 。

These two steps can also be combined and done together with the TfidfVectorizer . 这两个步骤也可以与TfidfVectorizer一起组合完成。

These are in the sklearn.feature_extraction.text module [ link ]. 它们位于sklearn.feature_extraction.text模块[ link ]中。

Both processes will return the same sparse matrix representation, on which I presume you will probably do SVD transform by TruncatedSVD to get a smaller dense matrix. 两个进程都将返回相同的稀疏矩阵表示，我假设您可能会通过TruncatedSVD进行SVD变换以获得更小的密集矩阵。

You can also of course do it yourself, which requires keeping two maps, one for each document, and one overall, where you count the terms. 你当然也可以自己做，这需要保留两张地图，每张文件一张，一张整体，你可以计算条款。 That is how they operate under the hood. 这就是他们在引擎盖下运作的方式。

This page has some nice examples. 这个页面有一些很好的例子。