简体   繁体   English

Python:遍历文件中的行

[英]Python: Iterating through lines in files

queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
for query in queries:
    for tweet in tweets:
        query_words = query.split()
        tweet_words = tweet.split()
        for qword in query_words:
            for tword in tweet_words:
               #Comparison

I'm trying to use python to iterate over two files with multiple lines in each of them. 我正在尝试使用python遍历两个文件,每个文件中有多行。 What I want to do is, to break down each line in both files into words, and then compare each word in the current line in the "query" file with each word in the current line in the "tweet" file. 我想做的是,将两个文件中的每一行分解为单词,然后将“查询”文件中当前行中的每个单词与“ tweet”文件中当前行中的每个单词进行比较。 The above is what I did till now, but it's only working for the first line in the query file and skips over the rest of the lines in it. 上面是我到目前为止所做的,但是它仅适用于查询文件中的第一行,并且跳过了其中的其余行。 It does work for each line in the tweet file. 它确实适用于tweet文件中的每一行。 Any help? 有什么帮助吗?

Edit for the duplicate_comment: I understand that after iterating over the queries file it the file handle will be positioned at EOF. 编辑plicate_comment:我知道在遍历查询文件之后,文件句柄将位于EOF。 But I don't get why it isn't processing the next line in the queries file, and just going directly to EOF. 但是我不明白为什么它不处理查询文件中的下一行,而直接进入EOF。

Essentially what happens is that you go through all the lines in one file while looking just at the first line in the other file. 本质上发生的是,您浏览一个文件中的所有行,而只看另一个文件中的第一行。 You cannot go through those lines in the next iteration, because you've already read them out. 您无法在下一次迭代中遍历这些行,因为您已经读出了它们。

Do it like this: 像这样做:

queries = open(sys.argv[1],"rU").readlines()
tweets = open(sys.argv[2],"rU").readlines()

for i in range(min(len(queries), len(tweets))):
    tweet = tweets[i]
    query = queries[i]

    # comparison

The problem is that, after you iterate through every line of a file, you're at EOF . 问题在于,在遍历文件的每一行之后,您将处于EOF You either have to open it again or you ensure each line being processed as expected (split and compared in your example) before reading, or iterating, to the next line. 您必须再次打开它,或者确保在读取或迭代到下一行之前,按预期方式处理了每一行(在示例中进行了分割和比较)。 In your example, since file tweets is at EOF after the first iteration of query , it would seem like the file queries "skipped" to EOF starting the second iteration, simply because there is no more tweet to iterate through in nested loop. 在您的示例中,由于文件tweetsquery的第一次迭代后位于EOF ,因此似乎文件queries从第二次迭代开始“跳过”到EOF ,这仅仅是因为没有更多的tweet可以在嵌套循环中进行迭代。

Also, although garbage collection handles file closing for you, it is still a better practice to explicitly close each opened file. 另外,尽管垃圾回收会为您处理文件关闭,但最好是显式关闭每个打开的文件,这是一种更好的做法。

Refer to @Smac89's answer for modification. 请参阅@ Smac89的答案进行修改。

Instead of doing for loops like that, use the function file.readline() 不要使用for这样的循环,而要使用函数file.readline()

queries = open(sys.argv[1],"rU")
tweets = open(sys.argv[2],"rU")
query = queries.readline()
tweet = tweets.readline()
while (query != "" and tweet != ""):
    query_words = query.split()
    tweet_words = tweet.split()
    #comparison
    query = queries.readline()
    tweet = tweets.readline()

mirosval provided an easier answer, use his mirosval提供了一个更简单的答案,请使用

Consider using file.seek : 考虑使用file.seek

with open(sys.argv[1],"rU") as queries:
    with open(sys.argv[2],"rU") as tweets:
        for query in queries:
            query_words = query.split()
            for tweet in tweets:
                tweet_words = tweet.split()
                for qword in query_words:
                    for tword in tweet_words:
                        #Comparison
            tweets.seek(0) # go back to the start of the file

You want to iterate second file for each line of first file. 您要为第一个文件的每一行迭代第二个文件。 But look what happens : 但是看看会发生什么:

  • you open both files 您打开两个文件
  • you start iterating first file 您开始迭代第一个文件
  • get first line of first file 获取第一个文件的第一行
  • you iterate second file till the end => pointer of second file is at EOF 您迭代第二个文件,直到第二个文件的末尾=>的指针位于EOF处
  • you try processing second line of first file 您尝试处理第一个文件的第二行
  • pointer of second file is already at EOF and you immediately loop on next line of first file without any processing 第二个文件的指针已经在EOF上,您无需任何处理即可立即在第一个文件的下一行循环

So you have to rewind second file after each iteration of first file. 因此,您必须在第一个文件的每次迭代后倒回第二个文件。 You have two ways to do it : 您有两种方法可以做到:

  • load second file in memory as a list of lines with readlines and iterate through this list. 将第二个文件作为具有readlines的行列表加载到内存中,并遍历该列表。 As it is a list (and not a file) iteration will start at first position instead of current one 因为它是一个列表(而不是文件),所以迭代将从第一个位置而不是当前位置开始

     queries = open(sys.argv[1],"rU") tweets_file = open(sys.argv[2],"rU") tweets = tweets_file.readlines() # tweets is now a list of lines for query in queries: for tweet in tweets: query_words = query.split() tweet_words = tweet.split() for qword in query_words: for tword in tweet_words: #Comparison 
  • explicitely rewind the file with skip 使用skip显式倒带文件

     queries = open(sys.argv[1],"rU") tweets = open(sys.argv[2],"rU") for query in queries: for tweet in tweets: query_words = query.split() tweet_words = tweet.split() for qword in query_words: for tword in tweet_words: #Comparison tweets.seek(0) # explicitely rewind tweets 

First solution read second file only once but uses more memory. 第一种解决方案仅读取一次第二个文件,但使用更多的内存。 It should be prefered if second file if small (less than several hundreds of Mo on recent machines). 如果第二个文件较小 (在最近的计算机上小于几百个Mo),则应优先使用。 Second solution uses less memory and should be prefered is second file is huge ... or if you have to save memory for any reason (embedded system, lower impact of a script ...) 第二种解决方案使用的内存较少,因此最好使用第二种解决方案,因为第二个文件很大……或者如果出于某种原因(嵌入式系统,脚本的影响较小……)必须节省内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM