简体   繁体   中英

How to read two non-consecutive lines in a file once in python

I want to read two non-consecutive lines once in a very big file in python, that is, the expected result is the following(the number is the line number):

1 6 (the first time reads the 1st and 6th line)

2 7 (the second time reads the 2nd and 7th line)

3 8 (the third time reads the 3rd and 8th line)

4 9

5 10

6 11

7 12

8 13

9 14

10 15

..

..

..

..

.. (the last time reads the 6th from the last line and the last line)

The line number can be gotten by using the method enumerate(file object). I am a beginner for python, and just know how to read two consecutive lines once. Could you please share with me how to get the above expected result? Thanks for your time!

You can open the file multiple times. Just skip the first couple of lines in the second file handle and use zip to read them both at the same time:

with open(filename, 'r') as handle1, \
     open(filename, 'r') as handle2:
    for _ in range(5):
        handle2.readline()

    for line1, line2 in zip(handle1, handle2):
        print(line1, line2)

You'll have to from itertools import izip as zip in Python 2 to get zip to not read the whole file twice into memory.

An alternative solution would be to use a double-ended queue and keep 5 + 1 lines in memory while reading from the file:

from collections import deque

lines = deque([], maxlen=6)

with open(filename, 'r') as handle:
    for line in handle:
        lines.append(line)

        if len(lines) == lines.maxlen:
            print(lines[0], lines[-1])

This reads the file only once, but if your file is really long and your two lines are separated by more than 5, you'll have to keep that many strings in memory.

If you want single iteration over the file (read the file once), then you could write a variation on the itertools pairwise recipe, advancing the second iterator by 5 instead of 1 (using consume ). This is essentially the same as @Blender's answer that uses a deque, but a functional way about tackling the problem that may be easier to reason about (and prove that the implementation is correct without edge cases).

From python3 itertools recipe book :

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

Example solution:

def pairwise(iterable, n=1):
    """Return every pair of items, separated by n items, from iterable
    s -> (s[0], s[n]), (s[1], s[n+1]), (s[2], s[2+n]), ..."""
    assert n >= 1
    a, b = tee(iterable)
    # advance leading iterator (b) by n elements
    consume(b, n)
    return zip(a, b)

with open(filename, 'r') as f:
    for a, b in pairwise(f, 5):
        print('{}: {}'.format(a, b))

Both the collections.deque solution and this pairwise solution require storing n+1 items in memory at a time, plus the overhead for the containers. Opening the file twice with two file pointers reduces the memory to just 2 items, but may be significantly slower due to twice the number of disk I/O accesses (assuming the OS doesn't cache reads from the file). Reading from the file twice is a little riskier because the file could be modified while reading, producing inconsistent data. To be fair, modifying a file while iterating over it once or twice will produce questionable results.

It goes without saying that if you're (still) using Python 2.7, replace zip with itertools.izip , unless you want to store the entire file contents in RAM and potentially crash your program or system for large input files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM