简体   繁体   English

从python中的多个大文件连接每n行

[英]Concatenate every n-th line from multiple large files in python

Consider the following files of different size: 考虑以下大小不同的文件:

file1.txt FILE1.TXT

sad
mad
rad
cad
saf

file2.txt FILE2.TXT

er
ar
ir
lr
gr
cf

file3.txt file3.txt

1
2
3
4
5
6
7
8
9

I am looking for a way to concatenate every second line from all the files so the desired output file would be: 我正在寻找一种方法来连接所有文件的第二行,因此所需的输出文件将是:

sad
er
1
rad
ir
3
saf
gr
5
7
9

I successfully manage to do it using the following script for my test files: 我成功使用以下脚本对测试文件进行了管理:

import os    

globalList = list()

for file in os.listdir('.'):
    if file.endswith('txt'):
        with open(file, 'r') as inf:
            l = list()
            n=0
            for i, line in enumerate(inf):
                if i == n:
                    nline=line.strip()
                    l.append(nline)
                    n+=2

            globalList.append(l)

            inf.close()

ouf = open('final.txt', 'w')

for i in range(len(max(globalList, key=len))):
    for x in globalList:
        if i < len(x):
            ouf.write(x[i])
            ouf.write('\n')
        else:
            pass

ouf.close()

The above script works fine with small test files. 上面的脚本适用于小的测试文件。 However, when I try it with my actual files (hundreds of files with millions of lines) my computer quickly runs out of memory and the script crashes. 但是,当我用实际文件(成百上千行的数百个文件)进行尝试时,我的计算机很快就会耗尽内存,并且脚本崩溃。 Is there a way to overcome this problem, ie to avoid storing so much information in RAM and somehow directly write the lines in an output file? 有没有办法解决这个问题,即避免在RAM中存储太多信息,而以某种方式直接将行写入输出文件中? Thanks! 谢谢!

Try this code in python3: 在python3中尝试以下代码:

script.py script.py

from itertools import  zip_longest
import glob


every_xth_line = 2
files = [open(filename) for filename in glob.glob("*.txt")]

with open('output.txt', 'w') as f:
    trigger = 0
    for lines in zip_longest(*files, fillvalue=''):
        if not trigger:
            for line in lines:
                f.write(line)
        trigger = (trigger + 1) % every_xth_line

output.txt output.txt的

sad
er
1
rad
ir
3
saf
gr
5
7
9

open itself actually can be iterated over. open本身实际上可以被迭代。 zip_longest makes sure that the script will run until the longest file has been exhausted, and the fillvalues are simply inserted as empty strings. zip_longest确保脚本将一直运行,直到用完了最长的文件,并且fillvalues只是作为空字符串插入。

A trigger must be used to separate even and uneven files, a more general solution can be achieved with a simple modulo operation by setting every_xth_line to something else. 必须使用触发器来分离均匀和不均匀的文件,通过将every_xth_line设置为其他内容,可以通过简单的模运算来实现更通用的解决方案。

As for scaleability: 至于可伸缩性:

I tried to generate large-ish files: 我试图生成大型文件:

cat /usr/share/dict/words > file1.txt
cat /usr/share/dict/words > file2.txt
cat /usr/share/dict/words > file3.txt

After some copy paste: 粘贴一些副本后:

68M Nov  1 13:45 file.txt
68M Nov  1 13:45 file2.txt
68M Nov  1 13:45 file3.txt

Running it: 运行它:

time python3 script.py
4.31user 0.14system 0:04.46elapsed 99%CPU (0avgtext+0avgdata 9828maxresident)k
0inputs+206312outputs (0major+1146minor)pagefaults 0swaps

The result: 结果:

101M Nov  1 13:46 output.txt

I believe something like this is what you want. 我相信您想要的是这样的东西。 Note that I don't store arrays of lines but lazyly read line when I need one. 请注意,我不存储行数组,而是在需要时懒惰地读取行。 It helps to save memory 它有助于节省内存

import os


files = [open(file) for file in os.listdir('.') if file.endswith('txt')]
with open('final.txt', 'w') as f:
    while 1:
        for file in files:
            try:
                f.write(next(f))
            except StopIteration:
                break
            if YourCounterFunction:
                break

Try reading the lines one at a time. 尝试一次阅读每一行。 If we could figure out how to not short-circuit the or we could probably get by with none as the return of get_odd 如果我们能弄清楚如何不使短路,或者我们可能没有得到get_odd的返回

#!/usr/bin/env python3

def get_odd(f):
    x = f.readline().strip()
    if x: print(x)
    return f.readline() or ""

with open("file1.txt", 'r') as x:
    with open("file2.txt", 'r') as y:
        with open("file3.txt", 'r') as z:
            while ("" != (get_odd(x) + get_odd(y) + get_odd(z))):
                pass

I would create one generator for the odd number of lines. 我将为奇数行创建一个生成器。 Then get the lines I want and write them to the file. 然后获取我想要的行并将其写入文件。 Here's the code: 这是代码:

def numberLine():
    number = -2
    while True:
        number += 2
        yield number

def writeNewFile(files):
    with open("newFile.txt", 'w') as theFile:
        for line in numberLine():
            if files:
                for file in files:
                    try:
                        with open(file) as openFile:
                            theFile.write(openFile.readlines()[line])
                    except IndexError:
                        files.remove(file)
                        continue
            else:
                break

Now all you need to do is pass the list with files into the writeNewFile function! 现在,您需要做的就是将带有文件的列表传递到writeNewFile函数中! writeNewFile([file for file in os.listdir() if file.endswith('txt')])

This script processes an arbitrary number of files and prints every second line of each file until all files have reached EOF. 该脚本处理任意数量的文件,并在每个文件的第二行打印直到所有文件都达到EOF。

#!/usr/bin/env python

import sys

def every_second(files):
    fds = [open(f,'r') for f in files]

    i = 0
    end = 0
    num = len(fds)
    while end < num:
        for fd in fds:
            try:
                l = fd.readline()
            except:
                continue
            if l == "":
                end += 1
                fd.close()
            elif i%2 == 0:
                sys.stdout.write(l)
        i += 1

if __name__ == '__main__':
    every_second(sys.argv[1:])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM