在公告表中查找tf-idf值

Question

I want to do an analysis of an announcement.I have to calculate 'tf' and 'idf' values. 我想对公告进行分析。我必须计算'tf'和'idf'值。 But I think the values are not realistic. 但我认为这些价值观并不现实。 Is there a problem with the code? 代码有问题吗？

"stemming" line is announcements. “阻止”线是公告。 The first announcement is 'kurs kayıt tarih progra giriş çıkış saat' 第一个公告是'kurskayıttarihprogragirişçıkışsaat'

tf1 = (train['stemming'][0:1]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()  #Term frequency
tf1.columns = ['words','tf']

for i,word in enumerate(tf1['words']):    #Inverse Document Frequency
  tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['stemming'].str.contains(word)])))

tf1['tf-idf'] = tf1['tf'] * tf1['idf'] # 3.4 Term Frequency – Inverse Document Frequency (TF-IDF)

For the first word (kurs), tf value must be 1/7 according to TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). 对于第一个单词（kurs），根据TF（t）=（术语t出现在文档中的次数）/（文档中的术语总数 ），tf值必须是1/7 。 But results is that 但结果就是这样

Answer 1

The problem is that when you're computing the tf you are only counting the occurrences of each word. 问题在于，当你计算tf时，你只计算每个单词的出现次数。 You need to divide that value by the total number of different words. 您需要将该值除以不同单词的总数。

在公告表中查找tf-idf值

问题描述

1 个解决方案

解决方案1
0 2019-05-18 12:34:22

在公告表中查找tf-idf值

问题描述

1 个解决方案

解决方案1 0 2019-05-18 12:34:22

解决方案1
0 2019-05-18 12:34:22