[英]compute tf-idf with corpus
因此,我復制了一個有關如何創建可以運行tf-idf的系統的源代碼,下面是代碼:
#module import
from __future__ import division, unicode_literals
import math
import string
import re
import os
from text.blob import TextBlob as tb
#create a new array
words = {}
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
regex = re.compile('[%s]' % re.escape(string.punctuation))
f = open('D:/article/sport/a.txt','r')
var = f.read()
var = regex.sub(' ', var)
var = var.lower()
document1 = tb(var)
f = open('D:/article/food/b.txt','r')
var = f.read()
var = var.lower()
document2 = tb(var)
bloblist = [document1, document2]
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:50]:
print("Word: {}, TF-IDF: {}".format(word, round(score, 5)))
但是,問題是,我想將所有文件放在corpora的sport文件夾中,並將food文件夾中的food文章放入另一個corpora,因此系統將為每個corpora給出結果。 現在,我只能比較文件,但我想在corpora。之間進行比較。很抱歉提出這個問題,我們將提供任何幫助。
謝謝
我得到的是,您想要計算兩個文件的詞頻並將它們存儲在不同的文件中以進行比較,為此,您可以使用終端。 這是計算單詞頻率的簡單代碼
import string
import collections
import operator
keywords = []
i=0
def removePunctuation(sentence):
sentence = sentence.lower()
new_sentence = ""
for char in sentence:
if char not in string.punctuation:
new_sentence = new_sentence + char
return new_sentence
def wordFrequences(sentence):
global i
wordFreq = {}
split_sentence = new_sentence.split()
for word in split_sentence:
wordFreq[word] = wordFreq.get(word,0) + 1
wordFreq.items()
# od = collections.OrderedDict(sorted(wordFreq.items(),reverse=True))
# print od
sorted_x= sorted(wordFreq.iteritems(), key=operator.itemgetter(1),reverse = True)
print sorted_x
for key, value in sorted_x:
keywords.append(key)
print keywords
f = open('D:/article/sport/a.txt','r')
sentence = f.read()
# sentence = "The first test of the function some some some some"
new_sentence = removePunctuation(sentence)
wordFrequences(new_sentence)
您必須通過更改文本文件的路徑來運行此代碼兩次,並且每次從這樣的console pass命令運行代碼時
python abovecode.py > destinationfile.txt
就像你的情況一樣
python abovecode.py > sportfolder/file1.txt
python abovecode.py > foodfolder/file2.txt
小鬼:如果您想讓單詞出現頻率,請省略該部分
print keywords
小鬼:如果你需要的話。 到他們的頻率然后忽略
print sorted_x
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.