Consider the following files of different size:
file1.txt
sad
mad
rad
cad
saf
file2.txt
er
ar
ir
lr
gr
cf
file3.txt
1
2
3
4
5
6
7
8
9
I am looking for a way to concatenate every second line from all the files so the desired output file would be:
sad
er
1
rad
ir
3
saf
gr
5
7
9
I successfully manage to do it using the following script for my test files:
import os
globalList = list()
for file in os.listdir('.'):
if file.endswith('txt'):
with open(file, 'r') as inf:
l = list()
n=0
for i, line in enumerate(inf):
if i == n:
nline=line.strip()
l.append(nline)
n+=2
globalList.append(l)
inf.close()
ouf = open('final.txt', 'w')
for i in range(len(max(globalList, key=len))):
for x in globalList:
if i < len(x):
ouf.write(x[i])
ouf.write('\n')
else:
pass
ouf.close()
The above script works fine with small test files. However, when I try it with my actual files (hundreds of files with millions of lines) my computer quickly runs out of memory and the script crashes. Is there a way to overcome this problem, ie to avoid storing so much information in RAM and somehow directly write the lines in an output file? Thanks!
Try this code in python3:
from itertools import zip_longest
import glob
every_xth_line = 2
files = [open(filename) for filename in glob.glob("*.txt")]
with open('output.txt', 'w') as f:
trigger = 0
for lines in zip_longest(*files, fillvalue=''):
if not trigger:
for line in lines:
f.write(line)
trigger = (trigger + 1) % every_xth_line
sad
er
1
rad
ir
3
saf
gr
5
7
9
open
itself actually can be iterated over. zip_longest
makes sure that the script will run until the longest file has been exhausted, and the fillvalues are simply inserted as empty strings.
A trigger must be used to separate even and uneven files, a more general solution can be achieved with a simple modulo operation by setting every_xth_line
to something else.
As for scaleability:
I tried to generate large-ish files:
cat /usr/share/dict/words > file1.txt
cat /usr/share/dict/words > file2.txt
cat /usr/share/dict/words > file3.txt
After some copy paste:
68M Nov 1 13:45 file.txt
68M Nov 1 13:45 file2.txt
68M Nov 1 13:45 file3.txt
Running it:
time python3 script.py
4.31user 0.14system 0:04.46elapsed 99%CPU (0avgtext+0avgdata 9828maxresident)k
0inputs+206312outputs (0major+1146minor)pagefaults 0swaps
The result:
101M Nov 1 13:46 output.txt
I believe something like this is what you want. Note that I don't store arrays of lines but lazyly read line when I need one. It helps to save memory
import os
files = [open(file) for file in os.listdir('.') if file.endswith('txt')]
with open('final.txt', 'w') as f:
while 1:
for file in files:
try:
f.write(next(f))
except StopIteration:
break
if YourCounterFunction:
break
Try reading the lines one at a time. If we could figure out how to not short-circuit the or we could probably get by with none as the return of get_odd
#!/usr/bin/env python3
def get_odd(f):
x = f.readline().strip()
if x: print(x)
return f.readline() or ""
with open("file1.txt", 'r') as x:
with open("file2.txt", 'r') as y:
with open("file3.txt", 'r') as z:
while ("" != (get_odd(x) + get_odd(y) + get_odd(z))):
pass
I would create one generator for the odd number of lines. Then get the lines I want and write them to the file. Here's the code:
def numberLine():
number = -2
while True:
number += 2
yield number
def writeNewFile(files):
with open("newFile.txt", 'w') as theFile:
for line in numberLine():
if files:
for file in files:
try:
with open(file) as openFile:
theFile.write(openFile.readlines()[line])
except IndexError:
files.remove(file)
continue
else:
break
Now all you need to do is pass the list with files into the writeNewFile
function! writeNewFile([file for file in os.listdir() if file.endswith('txt')])
This script processes an arbitrary number of files and prints every second line of each file until all files have reached EOF.
#!/usr/bin/env python
import sys
def every_second(files):
fds = [open(f,'r') for f in files]
i = 0
end = 0
num = len(fds)
while end < num:
for fd in fds:
try:
l = fd.readline()
except:
continue
if l == "":
end += 1
fd.close()
elif i%2 == 0:
sys.stdout.write(l)
i += 1
if __name__ == '__main__':
every_second(sys.argv[1:])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.