[英]python script to concatenate all the files in the directory into one file
我編寫了以下腳本來將目錄中的所有文件連接成一個文件。
就此而言,這可以進行優化嗎?
慣用的蟒蛇
時間
這是片段:
import time, glob
outfilename = 'all_' + str((int(time.time()))) + ".txt"
filenames = glob.glob('*.txt')
with open(outfilename, 'wb') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
infile = readfile.read()
for line in infile:
outfile.write(line)
outfile.write("\n\n")
使用shutil.copyfileobj
復制數據:
import shutil
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
shutil
以塊的形式從readfile
對象讀取,直接將它們寫入outfile
對象。 不要使用readline()
或迭代緩沖區,因為您不需要查找行結尾的開銷。
使用相同的模式進行讀寫; 這在使用Python 3時尤為重要; 我在這里使用了二進制模式。
使用Python 2.7,我做了一些“基准”測試
outfile.write(infile.read())
VS
shutil.copyfileobj(readfile, outfile)
我迭代了超過20個.txt文件,大小從63 MB到313 MB,聯合文件大小約為2.6 GB。 在這兩種方法中,正常讀取模式比二進制讀取模式執行得更好,而shutil.copyfileobj通常比outfile.write更快。
將最差組合(outfile.write,二進制模式)與最佳組合(shutil.copyfileobj,正常讀取模式)進行比較時,差異非常顯着:
outfile.write, binary mode: 43 seconds, on average.
shutil.copyfileobj, normal mode: 27 seconds, on average.
outfile在正常讀取模式下的最終大小為2620 MB,在二進制讀取模式下的最終大小為2578 MB。
無需使用那么多變量。
with open(outfilename, 'w') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
outfile.write(readfile.read() + "\n\n")
fileinput模塊提供了一種迭代多個文件的自然方式
for line in fileinput.input(glob.glob("*.txt")):
outfile.write(line)
我很想知道更多的表現,我使用了Martijn Pieters和Stephen Miller的答案。
我用shutil
和沒有shutil
嘗試了二進制和文本模式。 我試圖合並270個文件。
文字模式 -
def using_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
outfile.write(readfile.read())
二進制模式 -
def using_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
outfile.write(readfile.read())
二進制模式的運行時間 -
Shutil - 20.161773920059204
Normal - 17.327500820159912
文本模式的運行時間 -
Shutil - 20.47757601737976
Normal - 13.718038082122803
看起來在兩種模式下,shutil執行相同,而文本模式比二進制更快。
操作系統:Mac OS 10.14 Mojave。 Macbook Air 2017。
您可以直接迭代文件對象的行,而無需將整個內容讀入內存:
with open(fname, 'r') as readfile:
for line in readfile:
outfile.write(line)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.