简体   繁体   English

python脚本将目录中的所有文件连接成一个文件

[英]python script to concatenate all the files in the directory into one file

I have written the following script to concatenate all the files in the directory into one single file. 我编写了以下脚本来将目录中的所有文件连接成一个文件。

Can this be optimized, in terms of 就此而言,这可以进行优化吗?

  1. idiomatic python 惯用的蟒蛇

  2. time 时间

Here is the snippet: 这是片段:

import time, glob

outfilename = 'all_' + str((int(time.time()))) + ".txt"

filenames = glob.glob('*.txt')

with open(outfilename, 'wb') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            infile = readfile.read()
            for line in infile:
                outfile.write(line)
            outfile.write("\n\n")

Use shutil.copyfileobj to copy data: 使用shutil.copyfileobj复制数据:

import shutil

with open(outfilename, 'wb') as outfile:
    for filename in glob.glob('*.txt'):
        if filename == outfilename:
            # don't want to copy the output into the output
            continue
        with open(filename, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)

shutil reads from the readfile object in chunks, writing them to the outfile fileobject directly. shutil以块的形式从readfile对象读取,直接将它们写入outfile对象。 Do not use readline() or a iteration buffer, since you do not need the overhead of finding line endings. 不要使用readline()或迭代缓冲区,因为您不需要查找行结尾的开销。

Use the same mode for both reading and writing; 使用相同的模式进行读写; this is especially important when using Python 3; 这在使用Python 3时尤为重要; I've used binary mode for both here. 我在这里使用了二进制模式。

Using Python 2.7, I did some "benchmark" testing of 使用Python 2.7,我做了一些“基准”测试

outfile.write(infile.read())

vs VS

shutil.copyfileobj(readfile, outfile)

I iterated over 20 .txt files ranging in size from 63 MB to 313 MB with a joint file size of ~ 2.6 GB. 我迭代了超过20个.txt文件,大小从63 MB到313 MB,联合文件大小约为2.6 GB。 In both methods, normal read mode performed better than binary read mode and shutil.copyfileobj was generally faster than outfile.write. 在这两种方法中,正常读取模式比二进制读取模式执行得更好,而shutil.copyfileobj通常比outfile.write更快。

When comparing the worst combination (outfile.write, binary mode) with the best combination (shutil.copyfileobj, normal read mode), the difference was quite significant: 将最差组合(outfile.write,二进制模式)与最佳组合(shutil.copyfileobj,正常读取模式)进行比较时,差异非常显着:

outfile.write, binary mode: 43 seconds, on average.

shutil.copyfileobj, normal mode: 27 seconds, on average.

The outfile had a final size of 2620 MB in normal read mode vs 2578 MB in binary read mode. outfile在正常读取模式下的最终大小为2620 MB,在二进制读取模式下的最终大小为2578 MB。

No need to use that many variables. 无需使用那么多变量。

with open(outfilename, 'w') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            outfile.write(readfile.read() + "\n\n")

The fileinput module provides a natural way to iterate over multiple files fileinput模块提供了一种迭代多个文件的自然方式

for line in fileinput.input(glob.glob("*.txt")):
    outfile.write(line)

I was curious to check more on performance and I used answers of Martijn Pieters and Stephen Miller. 我很想知道更多的表现,我使用了Martijn Pieters和Stephen Miller的答案。

I tried binary and text modes with shutil and without shutil . 我用shutil和没有shutil尝试了二进制和文本模式。 I tried to merge 270 files. 我试图合并270个文件。

Text mode - 文字模式 -

def using_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                outfile.write(readfile.read())

Binary mode - 二进制模式 -

def using_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                outfile.write(readfile.read())

Running times for binary mode - 二进制模式的运行时间 -

Shutil - 20.161773920059204
Normal - 17.327500820159912

Running times for text mode - 文本模式的运行时间 -

Shutil - 20.47757601737976
Normal - 13.718038082122803

Looks like in both modes, shutil performs same while text mode is faster than binary. 看起来在两种模式下,shutil执行相同,而文本模式比二进制更快。

OS: Mac OS 10.14 Mojave. 操作系统:Mac OS 10.14 Mojave。 Macbook Air 2017. Macbook Air 2017。

You can iterate over the lines of a file object directly, without reading the whole thing into memory: 您可以直接迭代文件对象的行,而无需将整个内容读入内存:

with open(fname, 'r') as readfile:
    for line in readfile:
        outfile.write(line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM