简体   繁体   English

Python-如何计算不同推文中前100个单词的最高tf-idf值

[英]Python-how to calculate the highest tf-idf value of the first 100 words in different tweeets

I have tens of thounds of tweets saved in one .txt file, I want to calculate calculate the highest tf-idf value of the first 100 words in these tweeets, in other words, I want to compare the word's tf-idf value between different tweets, presently,the only thing that I could complete is comparing word's tf-idf value in the same tweets, I cannot find a way to compare word's tf-idf value between different tweets. 我在一个.txt文件中保存了数十磅的推文,我想计算这些推文中前100个单词的最高tf-idf值,换句话说,我想比较不同单词之间该单词的tf-idf值推文,目前,我唯一能完成的就是比较同一推文中单词的tf-idf值,我找不到在不同推文之间比较单词的tf-idf值的方法。

Please help me,I have been upset for a long time because of this problem. 请帮助我,由于这个问题,我已经很沮丧了。 /(ㄒoㄒ)/~~ /(ㄒoㄒ)/ ~~

Blow is my code:(only able to calculate the term's tfidf value in same tweets) 打击是我的代码:(只能在相同的推文中计算该词的tfidf值)

with open('D:/Data/ows/ows_sample.txt','rb') as f:
    tweet=f.readlines()
lines = csv.reader((line.replace('\x00','') for line in tweet), delimiter=',', quotechar='"')
wordterm=[]
for i in lines:
    i[1]= re.sub(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+|(?:@[\w_]+)', "", i[1])
    tweets=re.split(r"\W+",i[1])
    tweets=[w.lower() for w in tweets if w!=""]
    stopwords = open("D:/Data/ows/stopwords.txt", "r").read().split()
    terms = [t for t in tweets if not t in stopwords]
    wordterm.append(terms)

word=[' '.join(t) for t in wordterm]
tfidf_vectorizer = TfidfVectorizer(min_df = 1,use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(word)
terms_name = tfidf_vectorizer.get_feature_names()
toarry=tfidf_matrix.todense()

#below code will output the tf-idf value of each tweets' terms.
for ii in range(0,len(toarry)):
    print u"第"+ ii +u"个tweets"
    for jj in range(0,len(terms_name)):
        print terms_name[jj],'-',tfidf_matrix[ii,jj]

Now that I understand your question, I will try to answer your question a little better. 现在,我了解了您的问题,我将尽力回答您的问题。

To get the top 100 'tf-idf' scores in a way that is comparable across all tweets would either mean that you are letting go of the notion that there are distinct tweets, or you want to be able to compare the same words to each other by tf-idf score. 要以与所有推文相当的方式获得前100个“ tf-idf”得分,要么意味着您放弃了存在不同推文的概念,要么希望能够将相同的词与每个词进行比较其他通过tf-idf得分。

So for the first scenario, imagine that all your words are in 1 'document'. 因此,对于第一种情况,假设您的所有单词都在一个“文档”中。 This would essentially eliminate the 'idf' aspect of tf-idf, and what you'll get is basically a word count vectorizer, which can be compared with one another and you can get the top 100 words this way. 从本质上讲,这将消除tf-idf的“ idf”方面,并且您将获得的基本上是一个字数矢量化器,可以将它们相互比较,并以此方式获得前100个字。

words = ['the cat sat on the mat cat cat']
tfidf_vectorizer = TfidfVectorizer(min_df = 1,use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(words)
terms_name = tfidf_vectorizer.get_feature_names()
toarry=tfidf_matrix.todense()

toarry:
    matrix([ .75,  0.25,  0.25,  0.25,  0.5])

The other scenario is that you take each tweet separately, and then you compare the scores by their tf-idf scores. 另一种情况是,您分别获取每个推文,然后将其分数与它们的tf-idf分数进行比较。 This would result in the same words having different scores, because that's what tf-idf does - it calculates the importance of the word in the document relative to the corpus . 这将导致相同的单词具有不同的分数,因为tf-idf就是这样做的- 它计算单词在文档中相对于语料库的重要性

words = ['the cat sat on the mat cat', 'the fat rat sat on a mat', 'the bat and a rat sat on a mat']
tfidf_vectorizer = TfidfVectorizer(min_df = 1,use_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(words)
terms_name = tfidf_vectorizer.get_feature_names()
toarry=tfidf_matrix.todense()
for i in tfidf_matrix.toarray():
    print zip(terms_name, i)

[(u'and', 0.0), (u'bat', 0.0), (u'cat', 0.78800079617844954), (u'fat', 0.0), (u'mat', 0.23270298212286766), (u'on', 0.23270298212286766), (u'rat', 0.0), (u'sat', 0.23270298212286766), (u'the', 0.46540596424573533)]
[(u'and', 0.0), (u'bat', 0.0), (u'cat', 0.0), (u'fat', 0.57989687146162439), (u'mat', 0.34249643393071422), (u'on', 0.34249643393071422), (u'rat', 0.44102651785124652), (u'sat', 0.34249643393071422), (u'the', 0.34249643393071422)]
[(u'and', 0.50165133177159349), (u'bat', 0.50165133177159349), (u'cat', 0.0), (u'fat', 0.0), (u'mat', 0.29628335772067432), (u'on', 0.29628335772067432), (u'rat', 0.38151876810273028), (u'sat', 0.29628335772067432), (u'the', 0.29628335772067432)]

As you can see in the results, the same words will have different scores in each document, because tf-idf is a score of that term within each document. 正如您在结果中看到的那样,相同的单词在每个文档中将具有不同的分数,因为tf-idf是每个文档中该术语的分数。 So these are the two methods available to you, so depending on what you want, you can choose what's better for your purposes. 因此,这是您可以使用的两种方法,因此,根据您的需要,您可以选择更适合自己目的的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM