简体   繁体   中英

Finding tf-idf values in a announcement table

I want to do an analysis of an announcement.I have to calculate 'tf' and 'idf' values. But I think the values ​​are not realistic. Is there a problem with the code?

"stemming" line is announcements. The first announcement is 'kurs kayıt tarih progra giriş çıkış saat'

tf1 = (train['stemming'][0:1]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()  #Term frequency
tf1.columns = ['words','tf']

for i,word in enumerate(tf1['words']):    #Inverse Document Frequency
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['stemming'].str.contains(word)])))

tf1['tf-idf'] = tf1['tf'] * tf1['idf'] # 3.4 Term Frequency – Inverse Document Frequency (TF-IDF)

For the first word (kurs), tf value must be 1/7 according to TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). But results is that

The problem is that when you're computing the tf you are only counting the occurrences of each word. You need to divide that value by the total number of different words.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM