从python中的多个大文件连接每n行

Question

Consider the following files of different size: 考虑以下大小不同的文件：

file1.txt FILE1.TXT

sad
mad
rad
cad
saf

file2.txt FILE2.TXT

er
ar
ir
lr
gr
cf

file3.txt file3.txt

I am looking for a way to concatenate every second line from all the files so the desired output file would be: 我正在寻找一种方法来连接所有文件的第二行，因此所需的输出文件将是：

sad
er
1
rad
ir
3
saf
gr
5
7
9

I successfully manage to do it using the following script for my test files: 我成功使用以下脚本对测试文件进行了管理：

import os    

globalList = list()

for file in os.listdir('.'):
    if file.endswith('txt'):
        with open(file, 'r') as inf:
            l = list()
            n=0
            for i, line in enumerate(inf):
                if i == n:
                    nline=line.strip()
                    l.append(nline)
                    n+=2

            globalList.append(l)

            inf.close()

ouf = open('final.txt', 'w')

for i in range(len(max(globalList, key=len))):
    for x in globalList:
        if i < len(x):
            ouf.write(x[i])
            ouf.write('\n')
        else:
            pass

ouf.close()

The above script works fine with small test files. 上面的脚本适用于小的测试文件。 However, when I try it with my actual files (hundreds of files with millions of lines) my computer quickly runs out of memory and the script crashes. 但是，当我用实际文件（成百上千行的数百个文件）进行尝试时，我的计算机很快就会耗尽内存，并且脚本崩溃。 Is there a way to overcome this problem, ie to avoid storing so much information in RAM and somehow directly write the lines in an output file? 有没有办法解决这个问题，即避免在RAM中存储太多信息，而以某种方式直接将行写入输出文件中？ Thanks! 谢谢！

Answer 1

Try this code in python3: 在python3中尝试以下代码：

script.py script.py

from itertools import  zip_longest
import glob


every_xth_line = 2
files = [open(filename) for filename in glob.glob("*.txt")]

with open('output.txt', 'w') as f:
    trigger = 0
    for lines in zip_longest(*files, fillvalue=''):
        if not trigger:
            for line in lines:
                f.write(line)
        trigger = (trigger + 1) % every_xth_line

output.txt output.txt的

sad
er
1
rad
ir
3
saf
gr
5
7
9

open itself actually can be iterated over. open本身实际上可以被迭代。 zip_longest makes sure that the script will run until the longest file has been exhausted, and the fillvalues are simply inserted as empty strings. zip_longest确保脚本将一直运行，直到用完了最长的文件，并且fillvalues只是作为空字符串插入。

A trigger must be used to separate even and uneven files, a more general solution can be achieved with a simple modulo operation by setting every_xth_line to something else. 必须使用触发器来分离均匀和不均匀的文件，通过将every_xth_line设置为其他内容，可以通过简单的模运算来实现更通用的解决方案。

As for scaleability: 至于可伸缩性：

I tried to generate large-ish files: 我试图生成大型文件：

cat /usr/share/dict/words > file1.txt
cat /usr/share/dict/words > file2.txt
cat /usr/share/dict/words > file3.txt

After some copy paste: 粘贴一些副本后：

68M Nov  1 13:45 file.txt
68M Nov  1 13:45 file2.txt
68M Nov  1 13:45 file3.txt

Running it: 运行它：

time python3 script.py
4.31user 0.14system 0:04.46elapsed 99%CPU (0avgtext+0avgdata 9828maxresident)k
0inputs+206312outputs (0major+1146minor)pagefaults 0swaps

The result: 结果：

101M Nov  1 13:46 output.txt

Answer 2

I believe something like this is what you want. 我相信您想要的是这样的东西。 Note that I don't store arrays of lines but lazyly read line when I need one. 请注意，我不存储行数组，而是在需要时懒惰地读取行。 It helps to save memory 它有助于节省内存

import os


files = [open(file) for file in os.listdir('.') if file.endswith('txt')]
with open('final.txt', 'w') as f:
    while 1:
        for file in files:
            try:
                f.write(next(f))
            except StopIteration:
                break
            if YourCounterFunction:
                break

Answer 3

Try reading the lines one at a time. 尝试一次阅读每一行。 If we could figure out how to not short-circuit the or we could probably get by with none as the return of get_odd 如果我们能弄清楚如何不使短路，或者我们可能没有得到get_odd的返回

#!/usr/bin/env python3

def get_odd(f):
    x = f.readline().strip()
    if x: print(x)
    return f.readline() or ""

with open("file1.txt", 'r') as x:
    with open("file2.txt", 'r') as y:
        with open("file3.txt", 'r') as z:
            while ("" != (get_odd(x) + get_odd(y) + get_odd(z))):
                pass

Answer 4

I would create one generator for the odd number of lines. 我将为奇数行创建一个生成器。 Then get the lines I want and write them to the file. 然后获取我想要的行并将其写入文件。 Here's the code: 这是代码：

def numberLine():
    number = -2
    while True:
        number += 2
        yield number

def writeNewFile(files):
    with open("newFile.txt", 'w') as theFile:
        for line in numberLine():
            if files:
                for file in files:
                    try:
                        with open(file) as openFile:
                            theFile.write(openFile.readlines()[line])
                    except IndexError:
                        files.remove(file)
                        continue
            else:
                break

Now all you need to do is pass the list with files into the writeNewFile function! 现在，您需要做的就是将带有文件的列表传递到writeNewFile函数中！ writeNewFile([file for file in os.listdir() if file.endswith('txt')])

Answer 5

This script processes an arbitrary number of files and prints every second line of each file until all files have reached EOF. 该脚本处理任意数量的文件，并在每个文件的第二行打印直到所有文件都达到EOF。

#!/usr/bin/env python

import sys

def every_second(files):
    fds = [open(f,'r') for f in files]

    i = 0
    end = 0
    num = len(fds)
    while end < num:
        for fd in fds:
            try:
                l = fd.readline()
            except:
                continue
            if l == "":
                end += 1
                fd.close()
            elif i%2 == 0:
                sys.stdout.write(l)
        i += 1

if __name__ == '__main__':
    every_second(sys.argv[1:])

从python中的多个大文件连接每n行

问题描述

5 个解决方案

解决方案1
3 已采纳 2016-11-01 12:35:50

script.py script.py

output.txt output.txt的

解决方案2
0 2016-11-01 12:05:46

解决方案3
0 2016-11-01 12:19:58

解决方案4
0 2016-11-01 12:30:03

解决方案5
0 2016-11-01 12:31:49

从python中的多个大文件连接每n行

问题描述

5 个解决方案

解决方案1 3 已采纳 2016-11-01 12:35:50

script.py script.py

output.txt output.txt的

解决方案2 0 2016-11-01 12:05:46

解决方案3 0 2016-11-01 12:19:58

解决方案4 0 2016-11-01 12:30:03

解决方案5 0 2016-11-01 12:31:49

解决方案1
3 已采纳 2016-11-01 12:35:50

解决方案2
0 2016-11-01 12:05:46

解决方案3
0 2016-11-01 12:19:58

解决方案4
0 2016-11-01 12:30:03

解决方案5
0 2016-11-01 12:31:49