簡體   English   中英

python腳本將目錄中的所有文件連接成一個文件

[英]python script to concatenate all the files in the directory into one file

我編寫了以下腳本來將目錄中的所有文件連接成一個文件。

就此而言,這可以進行優化嗎?

  1. 慣用的蟒蛇

  2. 時間

這是片段:

import time, glob

outfilename = 'all_' + str((int(time.time()))) + ".txt"

filenames = glob.glob('*.txt')

with open(outfilename, 'wb') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            infile = readfile.read()
            for line in infile:
                outfile.write(line)
            outfile.write("\n\n")

使用shutil.copyfileobj復制數據:

import shutil

with open(outfilename, 'wb') as outfile:
    for filename in glob.glob('*.txt'):
        if filename == outfilename:
            # don't want to copy the output into the output
            continue
        with open(filename, 'rb') as readfile:
            shutil.copyfileobj(readfile, outfile)

shutil以塊的形式從readfile對象讀取,直接將它們寫入outfile對象。 不要使用readline()或迭代緩沖區,因為您不需要查找行結尾的開銷。

使用相同的模式進行讀寫; 這在使用Python 3時尤為重要; 我在這里使用了二進制模式。

使用Python 2.7,我做了一些“基准”測試

outfile.write(infile.read())

VS

shutil.copyfileobj(readfile, outfile)

我迭代了超過20個.txt文件,大小從63 MB到313 MB,聯合文件大小約為2.6 GB。 在這兩種方法中,正常讀取模式比二進制讀取模式執行得更好,而shutil.copyfileobj通常比outfile.write更快。

將最差組合(outfile.write,二進制模式)與最佳組合(shutil.copyfileobj,正常讀取模式)進行比較時,差異非常顯着:

outfile.write, binary mode: 43 seconds, on average.

shutil.copyfileobj, normal mode: 27 seconds, on average.

outfile在正常讀取模式下的最終大小為2620 MB,在二進制讀取模式下的最終大小為2578 MB。

無需使用那么多變量。

with open(outfilename, 'w') as outfile:
    for fname in filenames:
        with open(fname, 'r') as readfile:
            outfile.write(readfile.read() + "\n\n")

fileinput模塊提供了一種迭代多個文件的自然方式

for line in fileinput.input(glob.glob("*.txt")):
    outfile.write(line)

我很想知道更多的表現,我使用了Martijn Pieters和Stephen Miller的答案。

我用shutil和沒有shutil嘗試了二進制和文本模式。 我試圖合並270個文件。

文字模式 -

def using_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'w') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'r') as readfile:
                outfile.write(readfile.read())

二進制模式 -

def using_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                shutil.copyfileobj(readfile, outfile)

def without_shutil_text(outfilename):
    with open(outfilename, 'wb') as outfile:
        for filename in glob.glob('*.txt'):
            if filename == outfilename:
                # don't want to copy the output into the output
                continue
            with open(filename, 'rb') as readfile:
                outfile.write(readfile.read())

二進制模式的運行時間 -

Shutil - 20.161773920059204
Normal - 17.327500820159912

文本模式的運行時間 -

Shutil - 20.47757601737976
Normal - 13.718038082122803

看起來在兩種模式下,shutil執行相同,而文本模式比二進制更快。

操作系統:Mac OS 10.14 Mojave。 Macbook Air 2017。

您可以直接迭代文件對象的行,而無需將整個內容讀入內存:

with open(fname, 'r') as readfile:
    for line in readfile:
        outfile.write(line)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM