[英]Read, compress, write with multiprocessing
我在壓縮文件。 一個過程對於其中一些是好的,但是我壓縮了數千個,這可能(並且已經用了幾天),所以我想通過多處理加速它。 我已經讀過 ,我應該避免讓多個進程同時讀取文件,我猜我不應該同時寫多個進程。 這是我目前單獨運行的方法:
import tarfile, bz2, os
def compress(folder):
"compresses a folder into a file"
bz_file = bz2.BZ2File(folder+'.tbz', 'w')
with tarfile.open(mode='w', fileobj = bz_file) as tar:
for fn in os.listdir(folder):
read each file in the folder and do some pre processing
that will make the compressed file much smaller than without
tar.addfile( processed file )
bz_file.close()
return
這是一個文件夾並將其所有內容壓縮到一個文件中。 這使它們更容易處理和更有條理。 如果我把它扔到一個池中,那么我會有幾個進程同時讀取和寫入,所以我想避免這種情況。 我可以重做它,所以只有一個進程正在讀取文件,但我還有多個寫入:
import multiprocessing as mp
import tarfile, bz2, os
def compress(file_list):
folder = file_list[0]
bz_file = bz2.BZ2File(folder+'.tbz', 'w')
with tarfile.open(mode='w', fileobj = bz_file) as tar:
for i in file_list[1:]:
preprocess file data
tar.addfile(processed data)
bz_file.close()
return
cpu_count = mp.cpu_count()
p = mp.Pool(cpu_count)
for subfolder in os.listdir(main_folder):
read all files in subfolder into memory, place into file_list
place file_list into fld_list until fld_list contains cpu_count
file lists. then pass to p.map(compress, fld_list)
這仍然有許多進程一次寫入壓縮文件。 只是告訴tarfile使用什么樣的壓縮開始寫入硬盤驅動器的行為。 我無法讀取我需要壓縮到內存中的所有文件,因為我沒有那么多RAM來執行此操作 - 所以它也存在我多次重啟Pool.map的問題。
如何在單個進程中讀寫文件,但在多個進程中進行所有壓縮,同時避免多次重啟多處理.Pool?
不應使用multiprocessing.Pool
,而應使用multiprocessing.Queue
並創建收件箱和發件箱。
啟動單個進程以讀入文件並將數據放入收件箱隊列,並限制隊列的大小,這樣您就不會最終填滿RAM。 此處的示例壓縮單個文件,但可以調整它以一次處理整個文件夾。
def reader(inbox, input_path, num_procs):
"process that reads in files to be compressed and puts to inbox"
for fn in os.listdir(input_path):
path = os.path.join(input_path, fn)
# read in each file, put data into inbox
fname = os.path.basename(fn)
with open(fn, 'r') as src: lines = src.readlines()
data = [fname, lines]
inbox.put(data)
# read in everything, add finished notice for all running processes
for i in range(num_procs):
inbox.put(None) # when a compressor sees a None, it will stop
inbox.close()
return
但這只是問題的一半,另一部分是壓縮文件而不必將其寫入磁盤。 我們將StringIO
對象賦予壓縮函數而不是打開文件; 它被傳遞給tarfile
。 壓縮后,我們將StringIO對象放入發件箱隊列。
除非我們不能這樣做,因為StringIO對象不能被pickle,只有pickleable對象可以進入隊列。 但是, getvalue
函數可以以可選格式提供內容,因此使用getvalue獲取內容,關閉StringIO對象,然后將內容放入發件箱。
from io import StringIO
import tarfile
def compressHandler(inbox, outbox):
"process that pulls from inbox, compresses and puts to outbox"
supplier = iter(inbox.get, None) # stops when gets a None
while True:
try:
data = next(supplier) # grab data from inbox
pressed = compress(data) # compress it
ou_que.put(pressed) # put into outbox
except StopIteration:
outbox.put(None) # finished compressing, inform the writer
return # and quit
def compress(data):
"compress file"
bz_file = StringIO()
fname, lines = dat # see reader def for package order
with tarfile.open(mode='w:bz2', fileobj=bz_file) as tar:
info = tarfile.TarInfo(fname) # store file name
tar.addfile(info, StringIO(''.join(lines))) # compress
data = bz_file.getvalue()
bz_file.close()
return data
然后,編寫器進程從發件箱隊列中提取內容並將其寫入磁盤。 此函數需要知道啟動了多少壓縮進程,因此它只知道在聽到每個進程都已停止時停止。
def writer(outbox, output_path, num_procs):
"single process that writes compressed files to disk"
num_fin = 0
while True:
# all compression processes have finished
if num_finished >= num_procs: break
tardata = outbox.get()
# a compression process has finished
if tardata == None:
num_fin += 1
continue
fn, data = tardata
name = os.path.join(output_path, fn) + '.tbz'
with open(name, 'wb') as dst: dst.write(data)
return
最后,還有將它們放在一起的設置
import multiprocessing as mp
import os
def setup():
fld = 'file/path'
# multiprocess setup
num_procs = mp.cpu_count()
# inbox and outbox queues
inbox = mp.Queue(4*num_procs) # limit size
outbox = mp.Queue()
# one process to read
reader = mp.Process(target = reader, args = (inbox, fld, num_procs))
reader.start()
# n processes to compress
compressors = [mp.Process(target = compressHandler, args = (inbox, outbox))
for i in range(num_procs)]
for c in compressors: c.start()
# one process to write
writer = mp.Process(target = writer, args=(outbox, fld, num_procs))
writer.start()
writer.join() # wait for it to finish
print('done!')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.