如何正確遍歷兩個文件，將兩個文件中的字符串相互比較

Question

我在對單詞列表（文件2，制表符分隔，兩列）以及對其分配的情感（正面或負面）進行推文（文件1，標准twitter json響應）的情感分析時遇到麻煩。

問題是：頂部循環僅運行一次，然后腳本結束，而我循環瀏覽文件1，然后嵌套在其中，循環瀏覽文件2，並嘗試比較並保持每個推文的組合情感的總和。

所以我有：

def get_sentiments(tweet_file, sentiment_file):


    sent_score = 0
    for line in tweet_file:

        document = json.loads(line)
        tweets = document.get('text')

        if tweets != None:
            tweet = str(tweets.encode('utf-8'))

            #print tweet


            for z in sentiment_file:
                line = z.split('\t')
                word = line[0].strip()
                score = int(line[1].rstrip('\n').strip())

                #print score



                if word in tweet:
                    print "+++++++++++++++++++++++++++++++++++++++"
                    print word, tweet
                    sent_score += score



            print "====", sent_score, "====="

    #PROBLEM, IT'S ONLY DOING THIS FOR THE FIRST TWEET

file1 = open(tweetsfile.txt)
file2 = open(sentimentfile.txt)


get_sentiments(file1, file2)

我花了更好的一天時間來弄清楚為什么它會打印出所有tweet，而沒有為file2嵌套嵌套的for循環，但是有了它，它只會處理第一個tweet，然后退出。

Answer 1

它只執行一次的原因是for循環已到達文件的末尾，因此它停止了，因為沒有更多的行可讀取。

換句話說，您的循環第一次運行時，它會遍歷整個文件，然后由於沒有更多的行可讀取（因為它到達了文件的末尾），因此它不會再次循環，從而僅產生一個循環行正在處理。

因此，解決此問題的一種方法是“倒帶”文件，您可以使用文件對象的seek方法來實現。

如果文件不大，另一種方法是將它們全部讀取到列表或類似結構中，然后循環遍歷。

但是，由於您的情感分數是簡單的查找，因此最好的方法是使用情感分數構建字典，然后查找字典中的每個單詞以計算推文的整體情感：

import csv
import json

scores = {}  # empty dictionary to store scores for each word

with open('sentimentfile.txt') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        scores[row[0].strip()] = int(row[1].strip()) 


with open('tweetsfile.txt') as f:
    for line in f:
        tweet = json.loads(line)
        text = tweet.get('text','').encode('utf-8')
        if text:
            total_sentiment = sum(scores.get(word,0) for word in text.split())
            print("{}: {}".format(text,score))

with statement自動關閉文件處理程序。 我正在使用csv模塊讀取文件（它也適用於制表符分隔的文件）。

這行進行計算：

total_sentiment = sum(scores.get(word,0) for word in text.split())

這是編寫此循環的較短方法：

tweet_score = []
for word in text.split():
    if word in scores:
        tweet_score[word] = scores[word]

total_score = sum(tweet_score)

字典的get方法使用第二個可選參數來在找不到鍵時返回自定義值。 如果省略第二個參數，它將返回None 。 在我的循環中，如果單詞沒有分數，我將使用它返回0。

如何正確遍歷兩個文件，將兩個文件中的字符串相互比較

問題描述

1 個解決方案

解決方案1
3 已采納 2013-05-06 04:53:23

如何正確遍歷兩個文件，將兩個文件中的字符串相互比較

問題描述

1 個解決方案

解決方案1 3 已采納 2013-05-06 04:53:23

解決方案1
3 已采納 2013-05-06 04:53:23