如何使用 Scikit Learn 在语料库中获取单词/术语频率？

Question

I have a corpus of documents and I'd like to extract the word frequencies in each document.我有一个文档语料库，我想提取每个文档中的词频。 I could use CountVectorizer() to get term counts per document, and I could use TfidfVectorizer() to get term frequency-inverse document frequency, but neither seems to give me term frequencies alone.我可以使用CountVectorizer()来获取每个文档的术语计数，并且我可以使用TfidfVectorizer()来获取术语频率与文档频率相反的频率，但似乎都没有单独给我术语频率。 How do I get term frequencies?我如何获得术语频率？

This related question seems to ask my question, but the question and answers there concern term counts, not term frequencies.这个相关问题似乎在问我的问题，但那里的问题和答案涉及术语计数，而不是术语频率。 Maybe I'm the one misunderstanding these terms, but my understanding is that term counts are the integer number of times each term appears in the document whereas term frequencies are the term counts divided by the document length.也许我是误解这些术语的人，但我的理解是术语计数是 integer 每个术语出现在文档中的次数，而术语频率是术语计数除以文档长度。

Answer 1

There is the TfidfTransformer for this purpose.为此目的有TfidfTransformer 。 From the docs:从文档：

Transform a count matrix to a normalized tf or tf-idf representation将计数矩阵转换为标准化的tf或tf-idf表示

Since it only transforms a count matrix, you would need to use it in conjunction with an already vectorized matrix or use CountVectorizer before:由于它只转换一个计数矩阵，因此您需要将它与一个已经矢量化的矩阵结合使用，或者在之前使用CountVectorizer ：

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer


X_count = CountVectorizer().fit_transform(X_train)  # use first if X_train is not vectorized
X_tf = TfidfTransformer(use_idf=False).fit_transform(X_count)

Note that by setting use_idf=False you will get the term-frequency only.请注意，通过设置use_idf=False您将仅获得术语频率。

如何使用 Scikit Learn 在语料库中获取单词/术语频率？

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-08 08:04:42

如何使用 Scikit Learn 在语料库中获取单词/术语频率？

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-08 08:04:42

解决方案1
1 已采纳 2021-06-08 08:04:42