简体   繁体   中英

Concatenate every n-th line from multiple large files in python

Consider the following files of different size:

file1.txt

sad
mad
rad
cad
saf

file2.txt

er
ar
ir
lr
gr
cf

file3.txt

1
2
3
4
5
6
7
8
9

I am looking for a way to concatenate every second line from all the files so the desired output file would be:

sad
er
1
rad
ir
3
saf
gr
5
7
9

I successfully manage to do it using the following script for my test files:

import os    

globalList = list()

for file in os.listdir('.'):
    if file.endswith('txt'):
        with open(file, 'r') as inf:
            l = list()
            n=0
            for i, line in enumerate(inf):
                if i == n:
                    nline=line.strip()
                    l.append(nline)
                    n+=2

            globalList.append(l)

            inf.close()

ouf = open('final.txt', 'w')

for i in range(len(max(globalList, key=len))):
    for x in globalList:
        if i < len(x):
            ouf.write(x[i])
            ouf.write('\n')
        else:
            pass

ouf.close()

The above script works fine with small test files. However, when I try it with my actual files (hundreds of files with millions of lines) my computer quickly runs out of memory and the script crashes. Is there a way to overcome this problem, ie to avoid storing so much information in RAM and somehow directly write the lines in an output file? Thanks!

Try this code in python3:

script.py

from itertools import  zip_longest
import glob


every_xth_line = 2
files = [open(filename) for filename in glob.glob("*.txt")]

with open('output.txt', 'w') as f:
    trigger = 0
    for lines in zip_longest(*files, fillvalue=''):
        if not trigger:
            for line in lines:
                f.write(line)
        trigger = (trigger + 1) % every_xth_line

output.txt

sad
er
1
rad
ir
3
saf
gr
5
7
9

open itself actually can be iterated over. zip_longest makes sure that the script will run until the longest file has been exhausted, and the fillvalues are simply inserted as empty strings.

A trigger must be used to separate even and uneven files, a more general solution can be achieved with a simple modulo operation by setting every_xth_line to something else.

As for scaleability:

I tried to generate large-ish files:

cat /usr/share/dict/words > file1.txt
cat /usr/share/dict/words > file2.txt
cat /usr/share/dict/words > file3.txt

After some copy paste:

68M Nov  1 13:45 file.txt
68M Nov  1 13:45 file2.txt
68M Nov  1 13:45 file3.txt

Running it:

time python3 script.py
4.31user 0.14system 0:04.46elapsed 99%CPU (0avgtext+0avgdata 9828maxresident)k
0inputs+206312outputs (0major+1146minor)pagefaults 0swaps

The result:

101M Nov  1 13:46 output.txt

I believe something like this is what you want. Note that I don't store arrays of lines but lazyly read line when I need one. It helps to save memory

import os


files = [open(file) for file in os.listdir('.') if file.endswith('txt')]
with open('final.txt', 'w') as f:
    while 1:
        for file in files:
            try:
                f.write(next(f))
            except StopIteration:
                break
            if YourCounterFunction:
                break

Try reading the lines one at a time. If we could figure out how to not short-circuit the or we could probably get by with none as the return of get_odd

#!/usr/bin/env python3

def get_odd(f):
    x = f.readline().strip()
    if x: print(x)
    return f.readline() or ""

with open("file1.txt", 'r') as x:
    with open("file2.txt", 'r') as y:
        with open("file3.txt", 'r') as z:
            while ("" != (get_odd(x) + get_odd(y) + get_odd(z))):
                pass

I would create one generator for the odd number of lines. Then get the lines I want and write them to the file. Here's the code:

def numberLine():
    number = -2
    while True:
        number += 2
        yield number

def writeNewFile(files):
    with open("newFile.txt", 'w') as theFile:
        for line in numberLine():
            if files:
                for file in files:
                    try:
                        with open(file) as openFile:
                            theFile.write(openFile.readlines()[line])
                    except IndexError:
                        files.remove(file)
                        continue
            else:
                break

Now all you need to do is pass the list with files into the writeNewFile function! writeNewFile([file for file in os.listdir() if file.endswith('txt')])

This script processes an arbitrary number of files and prints every second line of each file until all files have reached EOF.

#!/usr/bin/env python

import sys

def every_second(files):
    fds = [open(f,'r') for f in files]

    i = 0
    end = 0
    num = len(fds)
    while end < num:
        for fd in fds:
            try:
                l = fd.readline()
            except:
                continue
            if l == "":
                end += 1
                fd.close()
            elif i%2 == 0:
                sys.stdout.write(l)
        i += 1

if __name__ == '__main__':
    every_second(sys.argv[1:])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM