i've written a function that basically calculates the inverse document frequency (log base 10 ( total no.of documents/ no.of documents that contain a particular word))
My code:
def tfidf(docs,doc_freqs):
res = []
t = sum(isinstance(i, list) for i in docs)
for key,val in doc_freqs.items():
res.append(math.log10(t/val))
pos = defaultdict(lambda:[])
for docID, lists in enumerate(docs):
for element in set(lists):
pos[element].append([docID] + res)
return pos
My output:
index = tfidf([['a', 'b', 'c'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
index['a']
[[0, 0.0, 0.3010299956639812, 0.3010299956639812], [1, 0.0, 0.3010299956639812, 0.3010299956639812]]
index['b']
[[0, 0.0, 0.3010299956639812, 0.3010299956639812]]
Desired output:
index = tfidf([['a', 'b', 'c'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
index['a']
[[0, 0.0], [1, 0.0]]
index['b']
[[0, 0.3010299956639812]]
So basically i only want to display the docid in which that term occurs followed by its idf value alone. (ie,) in the above example since term'a' occurs in both the documents the idf value is 0 .
Can anyone suggest what modifications i need to make in my code to print only the corresponding idf values according to the term specified at run time ??
Please help !!! Thanks in advance.
Wolf,
Right now you are appending the entirety of res
to the [docID]
, but you only care about the value associated with that element
. I suggest changing res
to a dict
like the following code:
import math
def tfidf(docs,doc_freqs):
res = {}
t = sum(isinstance(i, list) for i in docs)
for key,val in doc_freqs.items():
res[key] = math.log10(t/val)
pos = defaultdict(lambda:[])
for docID, lists in enumerate(docs):
for element in set(lists):
pos[element].append([docID, res[element]])
return pos
docs = [['a', 'b', 'a'], ['a']]
doc_freqs = {'a': 2., 'b': 1., 'c': 1.}
index = tfidf(docs, doc_freqs)
This is then your output:
index['a']
[[0, 0.0], [1, 0.0]]
index['b']
[[0, 0.3010299956639812]]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.