如何正确遍历两个文件，将两个文件中的字符串相互比较

Question

I am having trouble doing a sentiment analysis of tweets (file 1, standard twitter json response) against a list of words (file 2, tab delimited, two columns) with their sentiment assigned to them (either positive or negative). 我在对单词列表（文件2，制表符分隔，两列）以及对其分配的情感（正面或负面）进行推文（文件1，标准twitter json响应）的情感分析时遇到麻烦。

The problem is: the top loop is only running once and then the script ends while I am looping through file 1 then nested within that I am looping through file 2 and trying to compare and keep a running sum of the combined sentiment for each tweet. 问题是：顶部循环仅运行一次，然后脚本结束，而我循环浏览文件1，然后嵌套在其中，循环浏览文件2，并尝试比较并保持每个推文的组合情感的总和。

so i have: 所以我有：

def get_sentiments(tweet_file, sentiment_file):


    sent_score = 0
    for line in tweet_file:

        document = json.loads(line)
        tweets = document.get('text')

        if tweets != None:
            tweet = str(tweets.encode('utf-8'))

            #print tweet


            for z in sentiment_file:
                line = z.split('\t')
                word = line[0].strip()
                score = int(line[1].rstrip('\n').strip())

                #print score



                if word in tweet:
                    print "+++++++++++++++++++++++++++++++++++++++"
                    print word, tweet
                    sent_score += score



            print "====", sent_score, "====="

    #PROBLEM, IT'S ONLY DOING THIS FOR THE FIRST TWEET

file1 = open(tweetsfile.txt)
file2 = open(sentimentfile.txt)


get_sentiments(file1, file2)

I've spent the better half of a day trying to figure out why it prints out all the tweets without the nested for loop for file2, but with it, it only processes the first tweet then exits. 我花了更好的一天时间来弄清楚为什么它会打印出所有tweet，而没有为file2嵌套嵌套的for循环，但是有了它，它只会处理第一个tweet，然后退出。

Answer 1

The reason its only doing it once is that the for loop has reached the end of the file, so it stops since there are no more lines to read. 它只执行一次的原因是for循环已到达文件的末尾，因此它停止了，因为没有更多的行可读取。

In other words, the first time your loop runs, it steps through the entire file, and then since there are no more lines to read (since its reached the end of the file), it doesn't loop again, resulting in only one line being processed. 换句话说，您的循环第一次运行时，它会遍历整个文件，然后由于没有更多的行可读取（因为它到达了文件的末尾），因此它不会再次循环，从而仅产生一个循环行正在处理。

So one way to solve this is to "rewind" the file, you can do that with the seek method of the file object. 因此，解决此问题的一种方法是“倒带”文件，您可以使用文件对象的seek方法来实现。

If your files aren't big, another approach is to read them all into a list or similar structure and then loop through it. 如果文件不大，另一种方法是将它们全部读取到列表或类似结构中，然后循环遍历。

However, since your sentiment score is a simple lookup, the best approach would be to build a dictionary with the sentiment scores, then lookup each word in the dictionary to calculate the overall sentiment of the tweet: 但是，由于您的情感分数是简单的查找，因此最好的方法是使用情感分数构建字典，然后查找字典中的每个单词以计算推文的整体情感：

import csv
import json

scores = {}  # empty dictionary to store scores for each word

with open('sentimentfile.txt') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        scores[row[0].strip()] = int(row[1].strip()) 


with open('tweetsfile.txt') as f:
    for line in f:
        tweet = json.loads(line)
        text = tweet.get('text','').encode('utf-8')
        if text:
            total_sentiment = sum(scores.get(word,0) for word in text.split())
            print("{}: {}".format(text,score))

The with statement automatically closes file handlers. with statement自动关闭文件处理程序。 I am using the csv module to read the file (it works for tab delimited files as well). 我正在使用csv模块读取文件（它也适用于制表符分隔的文件）。

This line does the calculation: 这行进行计算：

total_sentiment = sum(scores.get(word,0) for word in text.split())

It is a shorter way to write this loop: 这是编写此循环的较短方法：

tweet_score = []
for word in text.split():
    if word in scores:
        tweet_score[word] = scores[word]

total_score = sum(tweet_score)

The get method of dictionaries takes a second optional argument to return a custom value when the key cannot be found; 字典的get方法使用第二个可选参数来在找不到键时返回自定义值。 if you omit this second argument, it will return None . 如果省略第二个参数，它将返回None 。 In my loop I am using it to return 0 if the word has no score. 在我的循环中，如果单词没有分数，我将使用它返回0。

如何正确遍历两个文件，将两个文件中的字符串相互比较

问题描述

1 个解决方案

解决方案1
3 已采纳 2013-05-06 04:53:23

如何正确遍历两个文件，将两个文件中的字符串相互比较

问题描述

1 个解决方案

解决方案1 3 已采纳 2013-05-06 04:53:23

解决方案1
3 已采纳 2013-05-06 04:53:23