繁体   English   中英

python脚本将目录中的所有文件连接成一个文件

[英]python script to concatenate all the files in the directory into one file

我编写了以下脚本来将目录中的所有文件连接成一个文件。

就此而言,这可以进行优化吗?

  1. 惯用的蟒蛇

  2. 时间

这是片段:

import time, glob

outfilename = 'all_' + str((int(time.time()))) + ".txt"

filenames = glob.glob('*.txt')

with open(outfilename, 'wb') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            infile = readfile.read()
            for line in infile:
                outfile.write(line)
            outfile.write("\n\n")

使用shutil.copyfileobj复制数据:

import shutil

with open(outfilename, 'wb') as outfile:
    for filename in glob.glob('*.txt'):
        if filename == outfilename:
            # don't want to copy the output into the output
            continue
        with open(filename, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)

shutil以块的形式从readfile对象读取,直接将它们写入outfile对象。 不要使用readline()或迭代缓冲区,因为您不需要查找行结尾的开销。

使用相同的模式进行读写; 这在使用Python 3时尤为重要; 我在这里使用了二进制模式。

使用Python 2.7,我做了一些“基准”测试

outfile.write(infile.read())

VS

shutil.copyfileobj(readfile, outfile)

我迭代了超过20个.txt文件,大小从63 MB到313 MB,联合文件大小约为2.6 GB。 在这两种方法中,正常读取模式比二进制读取模式执行得更好,而shutil.copyfileobj通常比outfile.write更快。

将最差组合(outfile.write,二进制模式)与最佳组合(shutil.copyfileobj,正常读取模式)进行比较时,差异非常显着:

outfile.write, binary mode: 43 seconds, on average.

shutil.copyfileobj, normal mode: 27 seconds, on average.

outfile在正常读取模式下的最终大小为2620 MB,在二进制读取模式下的最终大小为2578 MB。

无需使用那么多变量。

with open(outfilename, 'w') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            outfile.write(readfile.read() + "\n\n")

fileinput模块提供了一种迭代多个文件的自然方式

for line in fileinput.input(glob.glob("*.txt")):
    outfile.write(line)

我很想知道更多的表现,我使用了Martijn Pieters和Stephen Miller的答案。

我用shutil和没有shutil尝试了二进制和文本模式。 我试图合并270个文件。

文字模式 -

def using_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                outfile.write(readfile.read())

二进制模式 -

def using_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                outfile.write(readfile.read())

二进制模式的运行时间 -

Shutil - 20.161773920059204
Normal - 17.327500820159912

文本模式的运行时间 -

Shutil - 20.47757601737976
Normal - 13.718038082122803

看起来在两种模式下,shutil执行相同,而文本模式比二进制更快。

操作系统:Mac OS 10.14 Mojave。 Macbook Air 2017。

您可以直接迭代文件对象的行,而无需将整个内容读入内存:

with open(fname, 'r') as readfile:
    for line in readfile:
        outfile.write(line)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM