简体   繁体   中英

Python nested loop - get next N lines

I'm new to Python and trying to do a nested loop. I have a very large file (1.1 million rows), and I'd like to use it to create a file that has each line along with the next N lines, for example with the next 3 lines:

1    2
1    3
1    4
2    3
2    4
2    5

Right now I'm just trying to get the loops working with rownumbers instead of the strings since it's easier to visualize. I came up with this code, but it's not behaving how I want it to:

with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f: 
for i, line in enumerate(f):
     line_a = i
     lower_bound = i + 1
     upper_bound = i + 4
     with open('C:/working_file.txt', mode='r', encoding = 'utf8') as g:
        for j, line in enumerate(g):
            while j >= lower_bound and j <= upper_bound:
                line_b = j
                j = j+1
                print(line_a, line_b)

Instead of the output I want like above, it's giving me this:

990     991
990     992
990     993
990     994
990     992
990     993
990     994
990     993
990     994
990     994

As you can see the inner loop is iterating multiple times for each line in the outer loop. It seems like there should only be one iteration per line in the outer loop. What am I missing?

EDIT: My question was answered below, here is the exact code I ended up using:

from collections import deque
from itertools import cycle
log = open('C:/example.txt', mode='w', encoding = 'utf8') 
try:
    xrange 
except NameError: # python3
    xrange = range

def pack(d):
    tup = tuple(d)
    return zip(cycle(tup[0:1]), tup[1:])

def window(seq, n=2):
    it = iter(seq)
    d = deque((next(it, None) for _ in range(n)), maxlen=n)
    yield pack(d)
    for e in it:
        d.append(e)
        yield pack(d)

for l in window(open('c:/working_file.txt', mode='r', encoding='utf8'),100):
    for a, b in l:
        print(a.strip() + '\t' + b.strip(), file=log)

Based on window example from old docs you can use something like:

from collections import deque
from itertools import cycle

try:
    xrange 
except NameError: # python3
    xrange = range

def pack(d):
    tup = tuple(d)
    return zip(cycle(tup[0:1]), tup[1:])

def window(seq, n=2):
    it = iter(seq)
    d = deque((next(it, None) for _ in xrange(n)), maxlen=n)
    yield pack(d)
    for e in it:
        d.append(e)
        yield pack(d)

Demo:

>>> for l in window([1,2,3,4,5], 4):
...     for l1, l2 in l:
...         print l1, l2
...
1 2
1 3
1 4
2 3
2 4
2 5

So, basically you can pass your file to window to get desired result:

window(open('C:/working_file.txt', mode='r', encoding='utf8'), 4)

You can do this with slices. This is easiest if you read the whole file into a list first:

with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f: 
    data = f.readlines()

for i, line_a in enumerate(data):
    for j, line_b in enumerate(data[i+1:i+5], start=i+1):
        print(i, j)

When you change it to printing the lines instead of the line numbers, you can drop the second enumerate and just do for line_b in data[i+1:i+5] . Note that the slice includes the item at the start index, but not the item at the end index, so that needs to be one higher than your current upper bound.

Based on alko's answer, I would suggest using the window recipe unmodified

from itertools import islice

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result    
    for elem in it:
        result = result[1:] + (elem,)
        yield result

for l in window([1,2,3,4,5], 4):
    for item in l[1:]:
        print l[0], item

I think the easiest way to solve this problem would be to read your file into a dictionary...

my_data = {}
for i, line in enumerate(f):
    my_data[i] = line

After that is done you can do

for x in my_data:
    for y in range(1, 4):
        print my_data[x], my_data[x + y]

As written you are reading your million line file a million times for each line...

Since this was quite a big file, you might not want to load it all in memory at once. So to avoid reading a line more than once this is what you do.

  • Make a list with N elements, where N is the amount of next lines to read.

    • When you read the first line, add that to the first item in the list.
    • Add the nest line to the first and second item.
    • and so on for each line
  • When a item in that list reaches a length N, take it out and append it to the output file. And add a empty item at the end so you still have a list of N items.

This way you only need to read each line once, and you wont have to load the whole file in memory. You only need to hold, at max, N! lines in memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM