简体   繁体   English

在python中检查sklearn的tf-idf分数

[英]Check the tf-idf scores of sklearn in python

I am following the example here to calculate the TF-IDF values using sklearn. 我在这里按照示例使用sklearn计算TF-IDF值。

My code is as follows. 我的代码如下。

from sklearn.feature_extraction.text import TfidfVectorizer
myvocabulary = ['life', 'learning']
corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())

I want to calculate the tf-idf values for the two words life and learning for the 3 documents in corpus . 我想为corpus的3个文档计算lifelearning两个单词的tf-idf值。

According to the article I am referring (see Table below) I should get the following values for my example. 根据我要引用的文章(请参见下表),我的示例应获得以下值。
目标TF-IDF分数

However, the values I get from my code is completely different. 但是,我从代码中获得的值完全不同。 Please help me find what is wrong in my code and how to fix it. 请帮助我找到代码中的错误以及如何解决。

The main point is that you should not restrict the vocabulary to just two words ('life', 'learning') before constructing the term frequency matrix. 要点是,在构建术语频率矩阵之前,您不应将词汇限制为仅两个单词(“生活”,“学习”)。 If you do that, all other words will be ignored and it will affect the term frequency counting. 如果这样做,所有其他单词都会被忽略,并且会影响术语“频率计数”。

There are also several other steps that need to be taken into account if one wants to get exactly the same numbers as in the example by using sklearn: 如果要通过使用sklearn获得与示例中完全相同的数字,还需要考虑其他几个步骤:

  1. The features in the example are unigrams (single words) so I have set ngram_range=(1,1) . 示例中的功能是字母组合(单个单词),因此我设置了ngram_range=(1,1)

  2. The example uses different normalization than sklearn for the term frequency part (the term counts are normalized by document lengths in the example, whereas sklearn uses raw term counts by default). 该示例对术语频率部分使用与sklearn不同的归一化(示例中,术语计数通过文档长度进行归一化,而sklearn默认使用原始术语计数)。 Because of this, I have counted and normalized the term frequencies separately before calculating the idf part. 因此,在计算idf部分之前,我已经分别对术语频率进行了计数和归一化。

  3. The normalization in the example for the idf part is also not the default for sklearn. idf部分的示例中的规范化也不是sklearn的默认设置。 This can be adjusted to match the example by setting smooth_idf to false. 可以通过将smooth_idf设置为false来调整它以匹配示例。

  4. Sklearn's vectorizers discard by default words with just one character, but such words are kept in the example. Sklearn的矢量化程序默认情况下只丢弃一个字符的单词,但是这些单词保留在示例中。 In the code below, I have modified token_pattern to allow also 1-character words. 在下面的代码中,我修改了token_pattern以允许同时包含1个字符的单词。

The final tfidf matrix is obtained by multiplying the normalized counts by the idf vector. 最终的tfidf矩阵是通过将标准化计数乘以idf向量而获得的。

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import normalize
import pandas as pd

corpus = {1: "The game of life is a game of everlasting learning", 2: "The unexamined life is not worth living", 3: "Never stop learning"}

cvect = CountVectorizer(ngram_range=(1,1), token_pattern='(?u)\\b\\w+\\b')
counts = cvect.fit_transform(corpus.values())
normalized_counts = normalize(counts, norm='l1', axis=1)

tfidf = TfidfVectorizer(ngram_range=(1,1), token_pattern='(?u)\\b\\w+\\b', smooth_idf=False)
tfs = tfidf.fit_transform(corpus.values())
new_tfs = normalized_counts.multiply(tfidf.idf_)

feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
df = pd.DataFrame(new_tfs.T.todense(), index=feature_names, columns=corpus_index)

print(df.loc[['life', 'learning']])

However, in practice such modifications are rarely needed. 但是,实际上很少需要这种修改。 One usually obtains good results just by using TfidfVectorizer directly. 通常,仅直接使用TfidfVectorizer即可获得良好的效果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM