有没有办法在python中用TF-IDF找到句子的weitage

Question

我有一份清单

x=["hello there","hello world","my name is john"]

我已经完成了使用 TF-IDF 的矢量化

这是 TF-idf 的输出

  from sklearn.feature_extraction.text import TfidfVectorizer
  corpus = [
         "hello there","hello world","my name is john", ]
  vectorizer = TfidfVectorizer()

  X = vectorizer.fit_transform(corpus)

  X.toarray()



array([[0.60534851, 0.        , 0.        , 0.        , 0.        ,
      0.79596054, 0.        ],
     [0.60534851, 0.        , 0.        , 0.        , 0.        ,
      0.        , 0.79596054],
     [0.        , 0.5       , 0.5       , 0.5       , 0.5       ,
      0.        , 0.        ]])

我们能找到每个句子的权重吗（与所有文件相比）？？

如果是，那么如何？

Answer 1

我相信使用 TF-idf 您只能计算一个句子（或与此相关的文档）中单个单词的权重，这意味着您不能使用它来计算其他句子或文档中的句子权重。

然而，从这个页面我了解到了 TF-idf 是如何工作的。 您可以通过将它们更改为您特别需要的功能来“滥用”它们提供的功能。 请允许我演示：

import math

corpus = ["hello there", "hello world"]

file = open("your_document.txt", "r")
text = file.read()
file.close()

def computeTF(sentences, document):
    dict = {i: 0 for i in sentences}
    filelen = len(text.split(' ')) - 1

    for s in sentences:
        #   Since we're counting a whole sentence (containing >= 1 words) we need to count
        #   that whole sentence as a single word.
        sLength = len(s.split(' '))
        dict[s] = document.count(s)
        #   When you know the amount of occurences of the specific sentence s in the
        #   document, you can recalculate the amount of words in that document (considering
        #   s as a single word.
        filelen = filelen - dict[s] * (sLength - 1)

    for s in sentences:
        #   Since only after the previous we know the amount of words in the document, we
        #   need a separate loop to calculate the actual weights of each word.
        dict[s] = dict[s] / filelen

    return dict

def computeIDF(dict, sentences):
    idfDict = {s: dict[s] for s in sentences}
    N = len(dict)

    for s in sentences:
        if(idfDict[s] > 0):
            idfDict[s] = math.log10(N)
        else:
            idfDict[s] = 0

    return idfDict

dict = computeTF(corpus, text)
idfDict = computeIDF(dict, corpus)

for s in corpus:
    print("Sentence: {}, TF: {}, TF-idf: {}".format(s, dict[s], idfDict[s]))

此代码示例仅查看单个文本文件，但您可以轻松扩展它以查看多个文本文件。

有没有办法在python中用TF-IDF找到句子的weitage

问题描述

1 个解决方案

解决方案1
1 2019-12-10 13:16:11

有没有办法在python中用TF-IDF找到句子的weitage

问题描述

1 个解决方案

解决方案1 1 2019-12-10 13:16:11

解决方案1
1 2019-12-10 13:16:11