[英]How do I get word/term frequencies in a corpus using Scikit Learn?
I have a corpus of documents and I'd like to extract the word frequencies in each document.我有一个文档语料库,我想提取每个文档中的词频。 I could use
CountVectorizer()
to get term counts per document, and I could use TfidfVectorizer()
to get term frequency-inverse document frequency, but neither seems to give me term frequencies alone.我可以使用
CountVectorizer()
来获取每个文档的术语计数,并且我可以使用TfidfVectorizer()
来获取术语频率与文档频率相反的频率,但似乎都没有单独给我术语频率。 How do I get term frequencies?我如何获得术语频率?
This related question seems to ask my question, but the question and answers there concern term counts, not term frequencies.这个相关问题似乎在问我的问题,但那里的问题和答案涉及术语计数,而不是术语频率。 Maybe I'm the one misunderstanding these terms, but my understanding is that term counts are the integer number of times each term appears in the document whereas term frequencies are the term counts divided by the document length.
也许我是误解这些术语的人,但我的理解是术语计数是 integer 每个术语出现在文档中的次数,而术语频率是术语计数除以文档长度。
There is the TfidfTransformer
for this purpose.为此目的有
TfidfTransformer
。 From the docs:从文档:
Transform a count matrix to a normalized tf or tf-idf representation
将计数矩阵转换为标准化的tf或tf-idf表示
Since it only transforms a count matrix, you would need to use it in conjunction with an already vectorized matrix or use CountVectorizer
before:由于它只转换一个计数矩阵,因此您需要将它与一个已经矢量化的矩阵结合使用,或者在之前使用
CountVectorizer
:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
X_count = CountVectorizer().fit_transform(X_train) # use first if X_train is not vectorized
X_tf = TfidfTransformer(use_idf=False).fit_transform(X_count)
Note that by setting use_idf=False
you will get the term-frequency only.请注意,通过设置
use_idf=False
您将仅获得术语频率。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.