简体   繁体   English

TfidfVectorizer 背后的数学原理是什么?

[英]What is the math behind TfidfVectorizer?

I am trying to understand the math behind the TfidfVectorizer .我试图了解TfidfVectorizer背后的数学。 I used this tutorial, but my code is a little bit changed:我使用了本教程,但我的代码有一点改变:

what also says at the end that The values differ slightly because sklearn uses a smoothed version idf and various other little optimizations.最后还说The values differ slightly because sklearn uses a smoothed version idf and various other little optimizations.

I want to be able to use TfidfVectorizer but also calculate the same simple sample by my hand.我希望能够使用TfidfVectorizer ,但也可以手动计算相同的简单样本。

Here is my whole code: import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import TfidfVectorizer这是我的整个代码: 从 sklearn.feature_extraction.text 导入 pandas 作为 pd 从 sklearn.feature_extraction.text 导入 CountVectorizer 从 sklearn.feature_extraction.text 导入 TfidfTransformer 从 sklearn.feature_extraction.text 导入 TfidfVectorizer

def main():
    documentA = 'the man went out for a walk'
    documentB = 'the children sat around the fire'
    corpus = [documentA, documentB]
    bagOfWordsA = documentA.split(' ')
    bagOfWordsB = documentB.split(' ')

    uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))

    print('----------- compare word count -------------------')
    numOfWordsA = dict.fromkeys(uniqueWords, 0)
    for word in bagOfWordsA:
        numOfWordsA[word] += 1
    numOfWordsB = dict.fromkeys(uniqueWords, 0)
    for word in bagOfWordsB:
        numOfWordsB[word] += 1

    tfA = computeTF(numOfWordsA, bagOfWordsA)
    tfB = computeTF(numOfWordsB, bagOfWordsB)
    print(pd.DataFrame([tfA, tfB]))

    CV = CountVectorizer(stop_words=None, token_pattern='(?u)\\b\\w\\w*\\b')
    cv_ft = CV.fit_transform(corpus)

    tt = TfidfTransformer(use_idf=False, norm='l1')
    t = tt.fit_transform(cv_ft)
    print(pd.DataFrame(t.todense().tolist(), columns=CV.get_feature_names()))

    print('----------- compare idf -------------------')
    idfs = computeIDF([numOfWordsA, numOfWordsB])
    print(pd.DataFrame([idfs]))

    tfidfA = computeTFIDF(tfA, idfs)
    tfidfB = computeTFIDF(tfB, idfs)
    print(pd.DataFrame([tfidfA, tfidfB]))

    ttf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None)
    f = ttf.fit_transform(cv_ft)
    print(pd.DataFrame(f.todense().tolist(), columns=CV.get_feature_names()))

    print('----------- TfidfVectorizer -------------------')
    vectorizer = TfidfVectorizer(smooth_idf=False, use_idf=True, stop_words=None, token_pattern='(?u)\\b\\w\\w*\\b', norm=None)
    vectors = vectorizer.fit_transform([documentA, documentB])
    feature_names = vectorizer.get_feature_names()
    print(pd.DataFrame(vectors.todense().tolist(), columns=feature_names))


def computeTF(wordDict, bagOfWords):
    tfDict = {}
    bagOfWordsCount = len(bagOfWords)
    for word, count in wordDict.items():
        tfDict[word] = count / float(bagOfWordsCount)
    return tfDict


def computeIDF(documents):
    import math
    N = len(documents)

    idfDict = dict.fromkeys(documents[0].keys(), 0)
    for document in documents:
        for word, val in document.items():
            if val > 0:
                idfDict[word] += 1

    for word, val in idfDict.items():
        idfDict[word] = math.log(N / float(val))
    return idfDict


def computeTFIDF(tfBagOfWords, idfs):
    tfidf = {}
    for word, val in tfBagOfWords.items():
        tfidf[word] = val * idfs[word]
    return tfidf


if __name__ == "__main__":
    main()

I can compare calculation of Term Frequency.我可以比较词频的计算。 Both results look the same.两个结果看起来一样。 But when I calculate the IDF and then TF-IDF there are differences between the code from the website and TfidfVectorizer (I also try combination of CountVectorizer and TfidfTransformer to be sure it returns the same results like TfidfVectorizer does).但是当我计算 IDF 和 TF-IDF 时,来自网站的代码和TfidfVectorizer之间存在差异(我也尝试组合CountVectorizerTfidfTransformer以确保它返回与TfidfVectorizer相同的结果)。

Code Tf-Idf results:代码 Tf-Idf 结果:

在此处输入图像描述

TfidfVectorizer Tf-Idf results: TfidfVectorizer Tf-Idf 结果:

在此处输入图像描述

Can anybody help me with a code that would return the same returns as TfidfVectorizer or setting of TfidfVectorizer what would return the same results as the code above?任何人都可以帮助我返回与TfidfVectorizer相同的返回值的代码或TfidfVectorizer的设置将返回与上述代码相同的结果吗?

Here is my improvisation of your code to reproduce TfidfVectorizer output for your data.这是我为您的数据复制TfidfVectorizer output 的代码的即兴创作。


import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from IPython.display import display

documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'
corpus = [documentA, documentB]
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')

uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))

print('----------- compare word count -------------------')
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
    numOfWordsA[word] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
    numOfWordsB[word] += 1

series_A = pd.Series(numOfWordsA)
series_B = pd.Series(numOfWordsB)
df = pd.concat([series_A, series_B], axis=1).T
df = df.reindex(sorted(df.columns), axis=1)
display(df)

tf_df = df.divide(df.sum(1),axis='index')

n_d = 1+ tf_df.shape[0]
df_d_t = 1 + (tf_df.values>0).sum(0)
idf = np.log(n_d/df_d_t) + 1

pd.DataFrame(df.values * idf,
                  columns=df.columns )

在此处输入图像描述

tfidf = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w*\\b', norm=None)
pd.DataFrame(tfidf.fit_transform(corpus).todense(),
                  columns=tfidf.get_feature_names() )

在此处输入图像描述

More details on the implementation refer the documentation here .有关实施的更多详细信息,请参阅此处的文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM