简体   繁体   English

在公告表中查找tf-idf值

[英]Finding tf-idf values in a announcement table

I want to do an analysis of an announcement.I have to calculate 'tf' and 'idf' values. 我想对公告进行分析。我必须计算'tf'和'idf'值。 But I think the values ​​are not realistic. 但我认为这些价值观并不现实。 Is there a problem with the code? 代码有问题吗?

"stemming" line is announcements. “阻止”线是公告。 The first announcement is 'kurs kayıt tarih progra giriş çıkış saat' 第一个公告是'kurskayıttarihprogragirişçıkışsaat'

tf1 = (train['stemming'][0:1]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()  #Term frequency
tf1.columns = ['words','tf']

for i,word in enumerate(tf1['words']):    #Inverse Document Frequency
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['stemming'].str.contains(word)])))

tf1['tf-idf'] = tf1['tf'] * tf1['idf'] # 3.4 Term Frequency – Inverse Document Frequency (TF-IDF)

For the first word (kurs), tf value must be 1/7 according to TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). 对于第一个单词(kurs),根据TF(t)=(术语t出现在文档中的次数)/(文档中的术语总数 ),tf值必须是1/7 But results is that 但结果就是这样

The problem is that when you're computing the tf you are only counting the occurrences of each word. 问题在于,当你计算tf时,你只计算每个单词的出现次数。 You need to divide that value by the total number of different words. 您需要将该值除以不同单词的总数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM