简体   繁体   中英

tf-idf function in python need help to satisfy my output

i've written a function that basically calculates the inverse document frequency (log base 10 ( total no.of documents/ no.of documents that contain a particular word))

My code:

def tfidf(docs,doc_freqs):
    res = []
    t = sum(isinstance(i, list) for i in docs)
    for key,val in doc_freqs.items():
        res.append(math.log10(t/val))
    pos = defaultdict(lambda:[])
    for docID, lists in enumerate(docs):
        for element in set(lists):
            pos[element].append([docID] + res)
    return pos

My output:

index = tfidf([['a', 'b', 'c'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
index['a']
[[0, 0.0, 0.3010299956639812, 0.3010299956639812], [1, 0.0, 0.3010299956639812, 0.3010299956639812]]
index['b']
[[0, 0.0, 0.3010299956639812, 0.3010299956639812]]

Desired output:

index = tfidf([['a', 'b', 'c'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
index['a']
[[0, 0.0], [1, 0.0]]
index['b']
[[0, 0.3010299956639812]]

So basically i only want to display the docid in which that term occurs followed by its idf value alone. (ie,) in the above example since term'a' occurs in both the documents the idf value is 0 .

Can anyone suggest what modifications i need to make in my code to print only the corresponding idf values according to the term specified at run time ??

Please help !!! Thanks in advance.

Wolf,

Right now you are appending the entirety of res to the [docID] , but you only care about the value associated with that element . I suggest changing res to a dict like the following code:

import math

def tfidf(docs,doc_freqs):
    res = {}
    t = sum(isinstance(i, list) for i in docs)
    for key,val in doc_freqs.items():
        res[key] = math.log10(t/val)
    pos = defaultdict(lambda:[])
    for docID, lists in enumerate(docs):
        for element in set(lists):
            pos[element].append([docID, res[element]])
    return pos

docs = [['a', 'b', 'a'], ['a']]
doc_freqs = {'a': 2., 'b': 1., 'c': 1.}

index = tfidf(docs, doc_freqs)

This is then your output:

index['a']
[[0, 0.0], [1, 0.0]]

index['b']
[[0, 0.3010299956639812]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM