简体   繁体   English

如何正确遍历两个文件,将两个文件中的字符串相互比较

[英]how to properly loop through two files comparing strings in both files against each other

I am having trouble doing a sentiment analysis of tweets (file 1, standard twitter json response) against a list of words (file 2, tab delimited, two columns) with their sentiment assigned to them (either positive or negative). 我在对单词列表(文件2,制表符分隔,两列)以及对其分配的情感(正面或负面)进行推文(文件1,标准twitter json响应)的情感分析时遇到麻烦。

The problem is: the top loop is only running once and then the script ends while I am looping through file 1 then nested within that I am looping through file 2 and trying to compare and keep a running sum of the combined sentiment for each tweet. 问题是:顶部循环仅运行一次,然后脚本结束,而我循环浏览文件1,然后嵌套在其中,循环浏览文件2,并尝试比较并保持每个推文的组合情感的总和。

so i have: 所以我有:

def get_sentiments(tweet_file, sentiment_file):


    sent_score = 0
    for line in tweet_file:

        document = json.loads(line)
        tweets = document.get('text')

        if tweets != None:
            tweet = str(tweets.encode('utf-8'))

            #print tweet


            for z in sentiment_file:
                line = z.split('\t')
                word = line[0].strip()
                score = int(line[1].rstrip('\n').strip())

                #print score



                if word in tweet:
                    print "+++++++++++++++++++++++++++++++++++++++"
                    print word, tweet
                    sent_score += score



            print "====", sent_score, "====="

    #PROBLEM, IT'S ONLY DOING THIS FOR THE FIRST TWEET

file1 = open(tweetsfile.txt)
file2 = open(sentimentfile.txt)


get_sentiments(file1, file2)

I've spent the better half of a day trying to figure out why it prints out all the tweets without the nested for loop for file2, but with it, it only processes the first tweet then exits. 我花了更好的一天时间来弄清楚为什么它会打印出所有tweet,而没有为file2嵌套嵌套的for循环,但是有了它,它只会处理第一个tweet,然后退出。

The reason its only doing it once is that the for loop has reached the end of the file, so it stops since there are no more lines to read. 它只执行一次的原因是for循环已到达文件的末尾,因此它停止了,因为没有更多的行可读取。

In other words, the first time your loop runs, it steps through the entire file, and then since there are no more lines to read (since its reached the end of the file), it doesn't loop again, resulting in only one line being processed. 换句话说,您的循环第一次运行时,它会遍历整个文件,然后由于没有更多的行可读取(因为它到达了文件的末尾),因此它不会再次循环,从而仅产生一个循环行正在处理。

So one way to solve this is to "rewind" the file, you can do that with the seek method of the file object. 因此,解决此问题的一种方法是“倒带”文件,您可以使用文件对象的seek方法来实现。

If your files aren't big, another approach is to read them all into a list or similar structure and then loop through it. 如果文件不大,另一种方法是将它们全部读取到列表或类似结构中,然后循环遍历。

However, since your sentiment score is a simple lookup, the best approach would be to build a dictionary with the sentiment scores, then lookup each word in the dictionary to calculate the overall sentiment of the tweet: 但是,由于您的情感分数是简单的查找,因此最好的方法是使用情感分数构建字典,然后查找字典中的每个单词以计算推文的整体情感:

import csv
import json

scores = {}  # empty dictionary to store scores for each word

with open('sentimentfile.txt') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        scores[row[0].strip()] = int(row[1].strip()) 


with open('tweetsfile.txt') as f:
    for line in f:
        tweet = json.loads(line)
        text = tweet.get('text','').encode('utf-8')
        if text:
            total_sentiment = sum(scores.get(word,0) for word in text.split())
            print("{}: {}".format(text,score))

The with statement automatically closes file handlers. with statement自动关闭文件处理程序。 I am using the csv module to read the file (it works for tab delimited files as well). 我正在使用csv模块读取文件(它也适用于制表符分隔的文件)。

This line does the calculation: 这行进行计算:

total_sentiment = sum(scores.get(word,0) for word in text.split())

It is a shorter way to write this loop: 这是编写此循环的较短方法:

tweet_score = []
for word in text.split():
    if word in scores:
        tweet_score[word] = scores[word]

total_score = sum(tweet_score)

The get method of dictionaries takes a second optional argument to return a custom value when the key cannot be found; 字典的get方法使用第二个可选参数来在找不到键时返回自定义值。 if you omit this second argument, it will return None . 如果省略第二个参数,它将返回None In my loop I am using it to return 0 if the word has no score. 在我的循环中,如果单词没有分数,我将使用它返回0。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何正确遍历两个文件,比较两个文件中的字符串 - How to properly loop through two files, comparing strings in both files 相互匹配两个文件并将输出作为文件写入-Python - Match two files against each other and write output as file - Python 如何使用 fgrep/comm 比较两个大文件的唯一字符串? - How to comparing two big files on unique strings using fgrep/comm? 将目录中的文件相互比较,无需重复比较 - Comparing files in directory to each other with no repeated comparisons 比较两个文件时如何确保两个文件都保持打开状态 - how to make sure both files remain open when comparing two files 使用python读写文件,将不同文件中的字符相互比较 - Reading and writing to files in python, comparing characters in different files to each other 如何在 python 的 for 循环中仅遍历包含某些字符串的文件? - How to iterate through only files containing certain strings in for loop in python? 如何在Python中将两个目录中的所有文件相互比较? - How to compare all of the files in directory with each other two by two in Python? 如何相互检查两个列表? - How to check two lists against each other? 循环浏览Python中的文件夹以及包含字符串的文件 - Loop through folders in Python and for files containing strings
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM