[英]python script to concatenate all the files in the directory into one file
我编写了以下脚本来将目录中的所有文件连接成一个文件。
就此而言,这可以进行优化吗?
惯用的蟒蛇
时间
这是片段:
import time, glob
outfilename = 'all_' + str((int(time.time()))) + ".txt"
filenames = glob.glob('*.txt')
with open(outfilename, 'wb') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
infile = readfile.read()
for line in infile:
outfile.write(line)
outfile.write("\n\n")
使用shutil.copyfileobj
复制数据:
import shutil
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
shutil
以块的形式从readfile
对象读取,直接将它们写入outfile
对象。 不要使用readline()
或迭代缓冲区,因为您不需要查找行结尾的开销。
使用相同的模式进行读写; 这在使用Python 3时尤为重要; 我在这里使用了二进制模式。
使用Python 2.7,我做了一些“基准”测试
outfile.write(infile.read())
VS
shutil.copyfileobj(readfile, outfile)
我迭代了超过20个.txt文件,大小从63 MB到313 MB,联合文件大小约为2.6 GB。 在这两种方法中,正常读取模式比二进制读取模式执行得更好,而shutil.copyfileobj通常比outfile.write更快。
将最差组合(outfile.write,二进制模式)与最佳组合(shutil.copyfileobj,正常读取模式)进行比较时,差异非常显着:
outfile.write, binary mode: 43 seconds, on average.
shutil.copyfileobj, normal mode: 27 seconds, on average.
outfile在正常读取模式下的最终大小为2620 MB,在二进制读取模式下的最终大小为2578 MB。
无需使用那么多变量。
with open(outfilename, 'w') as outfile:
for fname in filenames:
with open(fname, 'r') as readfile:
outfile.write(readfile.read() + "\n\n")
fileinput模块提供了一种迭代多个文件的自然方式
for line in fileinput.input(glob.glob("*.txt")):
outfile.write(line)
我很想知道更多的表现,我使用了Martijn Pieters和Stephen Miller的答案。
我用shutil
和没有shutil
尝试了二进制和文本模式。 我试图合并270个文件。
文字模式 -
def using_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'w') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'r') as readfile:
outfile.write(readfile.read())
二进制模式 -
def using_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
shutil.copyfileobj(readfile, outfile)
def without_shutil_text(outfilename):
with open(outfilename, 'wb') as outfile:
for filename in glob.glob('*.txt'):
if filename == outfilename:
# don't want to copy the output into the output
continue
with open(filename, 'rb') as readfile:
outfile.write(readfile.read())
二进制模式的运行时间 -
Shutil - 20.161773920059204
Normal - 17.327500820159912
文本模式的运行时间 -
Shutil - 20.47757601737976
Normal - 13.718038082122803
看起来在两种模式下,shutil执行相同,而文本模式比二进制更快。
操作系统:Mac OS 10.14 Mojave。 Macbook Air 2017。
您可以直接迭代文件对象的行,而无需将整个内容读入内存:
with open(fname, 'r') as readfile:
for line in readfile:
outfile.write(line)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.