Read a line store it in a variable and then read another line and come back to the first line. Python 2

Question

This is a tricky question and I've read a lot of posts about it, but I haven't been able to make it work.

I have a big file. I need to read it line by line, and once I reach a line of the form "Total is: (any decimal number)" , take this string and to save the number in a variable. If the number is bigger than 40.0, then I need to find the fourth line above the Total line (for example, if the Total line was line 39, this line would be line 35). This line will be in the format "(number).(space)(substring)" . Finally, I need to parse this substring out and do further processing on it.

This is an example of what an input file might look like:

many lines that we don't care about
many lines that we don't care about
...
1. Hi45
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space
...
more lines we don't care about
and then more lines and then
again we get
2. How144
People: bla bla bla bla bla bla
whitespace
bla bla bla bla bla
Total is: (*here there will be a decimal number*)
bla bla
white space

I have tried many things, including using the re.search() method to capture what I need from each line I need to focus on.

Here is my code which I modified from another stackoverflow Q & A:

import re
import linecache
number = ""
higher_line = ""
found_line = ""

with open("filename_with_many_lines.txt") as aFile:
    for num, line in enumerate(aFile, 1):
        searchObj = re.search(r'(\bTotal\b)(\s)(\w+)(\:)(\s)(\d+.\d+)', line)
        if searchObj:
            print "this is on line", line
            print "this is the line number:", num
            var1 = searchObj.group(6)
            print var1
            if float(var1) > 40.0:
                number = num
                higher_line = number - 4
                print number
                print higher_line

                found_line = linecache.getline("filename_with_many_lines.txt", higher_line)
                print "found the", found_line

The expected output would be:

this is on line Total is: 45.5
this is the line number: 14857
14857
14853
found the 1. Hi145
this is on line Total is: 62.1
this is the line number: 14985
14985
14981
found the 2.How144

Answer 1

If the line you need is always four lines above the Total is: line, you could keep the previous lines in a bounded deque .

from collections import deque

with open(filename, 'r') as file:
    previous_lines = deque(maxlen=4)
    for line in file:
        if line.startswith('Total is: '):
            try:
                higher_line = previous_lines[-4]
                # store higher_line, do calculations, whatever
                break  # if you only want to do this once; else let it keep going
            except IndexError:
                # we don't have four previous lines yet
                # I've elected to simply skip this total line in that case
                pass
        previous_lines.append(line)

A bounded deque (one with a maximum length) will discard an item from the opposite side if adding a new item would cause it to exceed its maximum length. In this case, we're appending strings to the right side of the deque , so once the length of the deque reaches 4 , each new string we append to the right side will cause it to discard one string from the left side. Thus, at the beginning of the for loop, the deque will contain the four lines prior to the current line, with the oldest line at the far left (index 0 ).

In fact, the documentation on collections.deque mentions use cases very similar to ours:

Bounded length deques provide functionality similar to the tail filter in Unix. They are also useful for tracking transactions and other pools of data where only the most recent activity is of interest.

Answer 2

This stores the line which starts with a number and a dot into a variable called prevline . We print the prevline only if re.search returns a match object.

import re
with open("file") as aFile:
    prevline = ""
    for num, line in enumerate(aFile,1):
        m = re.match(r'\d+\.\s*.*', line)                                # stores the match object of the line which starts with a number and a dot
        if m:                                              
            prevline += re.match(r'\d+\.\s*(.*)', line).group()         # If there is any match found then this would append the whole line to the variable prevline. You could also write this line as prevline += m.group()

        searchObj = re.search(r'(\bTotal\b\s+\w+:\s+(\d+\.\d+))', line)  # Search for the line which contains the string Total plus a word plus a colon and a float number
        if searchObj:                                                   # if there is any
            score = float(searchObj.group(2))                           # then the float number is assigned to the variable called score
            if score > 40.0:                                            # Do all the below operations only if the float number we fetched was greater than 40.0
                print "this is the line number: ", num
                print "this is the line", searchObj.group(1)
                print num
                print num-4
                print "found the", prevline
                prevline = ""

Output:

this is on line Total is: 45.5
this is the line number:  8
8
4
found the 1. Hi45
this is on line Total is: 62.1
this is the line number:  20
20
16
found the 2. How144

Answer 3

I suggested an edit to Blacklight Shining's post that built on its deque solution, but it was rejected with the suggestion that it instead be made into an answer. Below, I show how Blacklight's solution does solve your problem, if you were to just stare at it for a moment.

with open(filename, 'r') as file:
    # Clear: we don't care about checking the first 4 lines for totals.
    # Instead, we just store them for later.
    previousLines = []
    previousLines.append(file.readline())
    previousLines.append(file.readline())
    previousLines.append(file.readline())
    previousLines.append(file.readline())

    # The earliest we should expect a total is at line 5.
    for lineNum, line in enumerate(file, 5):
        if line.startswith('Total is: '):
            prevLine = previousLines[0]
            high_num = prevLine.split()[1] # A
            score = float(line.strip("Total_is: ").strip("\n").strip()) # B

            if score > 40.0:
                # That's all! We've now got everything we need.
                # Display results as shown in example code.
                print "this is the line number : ", lineNum
                print "this is the line ", line.strip('\n')
                print lineNum
                print (lineNum - 4)
                print "found the ", prevLine

        # Critical - remove old line & push current line onto deque.
        previousLines = previousLines[1:] + [line]

I don't take advantage of deque , but my code accomplishes the same thing imperatively. I don't think it's necessarily a better answer than either of the others; I'm posting it to show how the problem you're trying to solve can be addressed with a very simple algorithm and simple tools. (Compare Avinash's clever 17 line solution with my dumbed-down 18 line solution.)

This simplified approach won't make you look like a wizard to anyone reading your code, but it also won't accidentally match on anything in the intervening lines. If you're dead set on hitting your lines with a regex, then just modify lines A and B. The general solution still works.

The point is, an easy way to remember what was on the line 4 lines back is to just store the last four lines in memory.

Read a line store it in a variable and then read another line and come back to the first line. Python 2

Question

3 answers

solution1
2 2015-04-10 01:30:37

solution2
1 ACCPTED 2015-04-10 01:48:46

solution3
1 2015-04-10 13:45:27

Read a line store it in a variable and then read another line and come back to the first line. Python 2

Question

3 answers

solution1 2 2015-04-10 01:30:37

solution2 1 ACCPTED 2015-04-10 01:48:46

solution3 1 2015-04-10 13:45:27

solution1
2 2015-04-10 01:30:37

solution2
1 ACCPTED 2015-04-10 01:48:46

solution3
1 2015-04-10 13:45:27