简体   繁体   English

如何使用 Scikit Learn 在语料库中获取单词/术语频率?

[英]How do I get word/term frequencies in a corpus using Scikit Learn?

I have a corpus of documents and I'd like to extract the word frequencies in each document.我有一个文档语料库,我想提取每个文档中的词频。 I could use CountVectorizer() to get term counts per document, and I could use TfidfVectorizer() to get term frequency-inverse document frequency, but neither seems to give me term frequencies alone.我可以使用CountVectorizer()来获取每个文档的术语计数,并且我可以使用TfidfVectorizer()来获取术语频率与文档频率相反的频率,但似乎都没有单独给我术语频率。 How do I get term frequencies?我如何获得术语频率?

This related question seems to ask my question, but the question and answers there concern term counts, not term frequencies.这个相关问题似乎在问我的问题,但那里的问题和答案涉及术语计数,而不是术语频率。 Maybe I'm the one misunderstanding these terms, but my understanding is that term counts are the integer number of times each term appears in the document whereas term frequencies are the term counts divided by the document length.也许我是误解这些术语的人,但我的理解是术语计数是 integer 每个术语出现在文档中的次数,而术语频率是术语计数除以文档长度。

There is the TfidfTransformer for this purpose.为此目的有TfidfTransformer From the docs:从文档:

Transform a count matrix to a normalized tf or tf-idf representation将计数矩阵转换为标准化的tftf-idf表示

Since it only transforms a count matrix, you would need to use it in conjunction with an already vectorized matrix or use CountVectorizer before:由于它只转换一个计数矩阵,因此您需要将它与一个已经矢量化的矩阵结合使用,或者在之前使用CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer


X_count = CountVectorizer().fit_transform(X_train)  # use first if X_train is not vectorized
X_tf = TfidfTransformer(use_idf=False).fit_transform(X_count)

Note that by setting use_idf=False you will get the term-frequency only.请注意,通过设置use_idf=False您将仅获得术语频率。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM