简体   繁体   中英

how to properly loop through two files comparing strings in both files against each other

I am having trouble doing a sentiment analysis of tweets (file 1, standard twitter json response) against a list of words (file 2, tab delimited, two columns) with their sentiment assigned to them (either positive or negative).

The problem is: the top loop is only running once and then the script ends while I am looping through file 1 then nested within that I am looping through file 2 and trying to compare and keep a running sum of the combined sentiment for each tweet.

so i have:

def get_sentiments(tweet_file, sentiment_file):


    sent_score = 0
    for line in tweet_file:

        document = json.loads(line)
        tweets = document.get('text')

        if tweets != None:
            tweet = str(tweets.encode('utf-8'))

            #print tweet


            for z in sentiment_file:
                line = z.split('\t')
                word = line[0].strip()
                score = int(line[1].rstrip('\n').strip())

                #print score



                if word in tweet:
                    print "+++++++++++++++++++++++++++++++++++++++"
                    print word, tweet
                    sent_score += score



            print "====", sent_score, "====="

    #PROBLEM, IT'S ONLY DOING THIS FOR THE FIRST TWEET

file1 = open(tweetsfile.txt)
file2 = open(sentimentfile.txt)


get_sentiments(file1, file2)

I've spent the better half of a day trying to figure out why it prints out all the tweets without the nested for loop for file2, but with it, it only processes the first tweet then exits.

The reason its only doing it once is that the for loop has reached the end of the file, so it stops since there are no more lines to read.

In other words, the first time your loop runs, it steps through the entire file, and then since there are no more lines to read (since its reached the end of the file), it doesn't loop again, resulting in only one line being processed.

So one way to solve this is to "rewind" the file, you can do that with the seek method of the file object.

If your files aren't big, another approach is to read them all into a list or similar structure and then loop through it.

However, since your sentiment score is a simple lookup, the best approach would be to build a dictionary with the sentiment scores, then lookup each word in the dictionary to calculate the overall sentiment of the tweet:

import csv
import json

scores = {}  # empty dictionary to store scores for each word

with open('sentimentfile.txt') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        scores[row[0].strip()] = int(row[1].strip()) 


with open('tweetsfile.txt') as f:
    for line in f:
        tweet = json.loads(line)
        text = tweet.get('text','').encode('utf-8')
        if text:
            total_sentiment = sum(scores.get(word,0) for word in text.split())
            print("{}: {}".format(text,score))

The with statement automatically closes file handlers. I am using the csv module to read the file (it works for tab delimited files as well).

This line does the calculation:

total_sentiment = sum(scores.get(word,0) for word in text.split())

It is a shorter way to write this loop:

tweet_score = []
for word in text.split():
    if word in scores:
        tweet_score[word] = scores[word]

total_score = sum(tweet_score)

The get method of dictionaries takes a second optional argument to return a custom value when the key cannot be found; if you omit this second argument, it will return None . In my loop I am using it to return 0 if the word has no score.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM