python中的tf-idf函数需要帮助才能满足我的输出

Question

我写了一个基本上计算逆文档频率的函数（对数基数为10（总文档数/包含特定单词的文档数））

我的代码：

def tfidf(docs,doc_freqs):
    res = []
    t = sum(isinstance(i, list) for i in docs)
    for key,val in doc_freqs.items():
        res.append(math.log10(t/val))
    pos = defaultdict(lambda:[])
    for docID, lists in enumerate(docs):
        for element in set(lists):
            pos[element].append([docID] + res)
    return pos

我的输出：

index = tfidf([['a', 'b', 'c'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
index['a']
[[0, 0.0, 0.3010299956639812, 0.3010299956639812], [1, 0.0, 0.3010299956639812, 0.3010299956639812]]
index['b']
[[0, 0.0, 0.3010299956639812, 0.3010299956639812]]

所需的输出：

index = tfidf([['a', 'b', 'c'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
index['a']
[[0, 0.0], [1, 0.0]]
index['b']
[[0, 0.3010299956639812]]

所以基本上我只想显示该词出现的docid，然后仅显示其idf值。 （即）在上面的示例中，因为两个文档中都出现了术语'a'，所以idf值为0。

谁能建议我需要在代码中进行哪些修改，才能根据运行时指定的术语仅打印相应的idf值？

请帮忙！！！ 提前致谢。

Answer 1

狼，

现在，您要将整个res附加到[docID] ，但是您只关心与该element关联的值。 我建议将res更改为类似以下代码的dict ：

import math

def tfidf(docs,doc_freqs):
    res = {}
    t = sum(isinstance(i, list) for i in docs)
    for key,val in doc_freqs.items():
        res[key] = math.log10(t/val)
    pos = defaultdict(lambda:[])
    for docID, lists in enumerate(docs):
        for element in set(lists):
            pos[element].append([docID, res[element]])
    return pos

docs = [['a', 'b', 'a'], ['a']]
doc_freqs = {'a': 2., 'b': 1., 'c': 1.}

index = tfidf(docs, doc_freqs)

这是您的输出：

index['a']
[[0, 0.0], [1, 0.0]]

index['b']
[[0, 0.3010299956639812]]

python中的tf-idf函数需要帮助才能满足我的输出

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-02-17 02:37:41

python中的tf-idf函数需要帮助才能满足我的输出

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-02-17 02:37:41

解决方案1
2 已采纳 2015-02-17 02:37:41