简体   繁体   中英

Tf-Idf calculation in python

I'm new to python, I was looking to write a function that calculates the term frequency-inverse document frequency given two parameters.

Parameters: docs........list of lists, where each sublist contains the tokens for one document. doc_freqs...dict from term to document frequency (In how many documents a specific term ).

Desired Output:

index = create_tfidf_index([['a', 'b', 'a'], ['a']], {'a': 2., 'b': 1., 'c': 1.})
index['a']
[[0, 0.0], [1, 0.0]]
index['b']  
[[0, 0.301...]]

My code to find doc_freq(second parameter in tfidf function)

def count_doc_frequencies(docs):
    tmp = []
    lst = {}
    for item in docs: tmp += set(item)
    for key in tmp: lst[key] = lst.get(key, 0) + 1
    return lst

res = Index().count_doc_frequencies([['a', 'b', 'a'], ['a', 'b', 'c'], ['a']])
res['a']
3

Now can anyone help me how to calculate tf-idf using these two parameters which i've described above and produce the output as i've shown..

Please help guys!!!

I would do this with scikit-learn unless you have to write the function yourself for the exam.

Here is a decent tutorial .

Official documentation on this is pretty good too. It demonstrates tokenization and actual tf-idf calculation .

Hope this helps some.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM