i have one list
x=["hello there","hello world","my name is john"]
i am done with vectorization with TF-IDF
this is output of TF-idf
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"hello there","hello world","my name is john", ]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X.toarray()
array([[0.60534851, 0. , 0. , 0. , 0. ,
0.79596054, 0. ],
[0.60534851, 0. , 0. , 0. , 0. ,
0. , 0.79596054],
[0. , 0.5 , 0.5 , 0.5 , 0.5 ,
0. , 0. ]])
can we find weightage of every sentence (compare with all documents)??
if yes then How??
I believe that with TF-idf you can only calculate the weight of single words in a sentence (or document for that matter), meaning you cannot use it to calculate the weight of sentences within other sentences or documents.
However, from this page I learned how TF-idf works. You can "abuse" the functions they give by changing them to what you need specifically. Allow me to demonstrate:
import math
corpus = ["hello there", "hello world"]
file = open("your_document.txt", "r")
text = file.read()
file.close()
def computeTF(sentences, document):
dict = {i: 0 for i in sentences}
filelen = len(text.split(' ')) - 1
for s in sentences:
# Since we're counting a whole sentence (containing >= 1 words) we need to count
# that whole sentence as a single word.
sLength = len(s.split(' '))
dict[s] = document.count(s)
# When you know the amount of occurences of the specific sentence s in the
# document, you can recalculate the amount of words in that document (considering
# s as a single word.
filelen = filelen - dict[s] * (sLength - 1)
for s in sentences:
# Since only after the previous we know the amount of words in the document, we
# need a separate loop to calculate the actual weights of each word.
dict[s] = dict[s] / filelen
return dict
def computeIDF(dict, sentences):
idfDict = {s: dict[s] for s in sentences}
N = len(dict)
for s in sentences:
if(idfDict[s] > 0):
idfDict[s] = math.log10(N)
else:
idfDict[s] = 0
return idfDict
dict = computeTF(corpus, text)
idfDict = computeIDF(dict, corpus)
for s in corpus:
print("Sentence: {}, TF: {}, TF-idf: {}".format(s, dict[s], idfDict[s]))
This code example only looks at a single text file, but you can easily extend it to look at several text files.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.