I am writing a function to convert PDF to PNG images, it looks like this:
import os
from wand.image import Image
def convert_pdf(filename, resolution):
with Image(filename=filename, resolution=resolution) as img:
pages_dir = os.path.join(os.path.dirname(filename), 'pages')
page_filename = os.path.splitext(os.path.basename(filename))[0] + '.png'
os.makedirs(pages_dir)
img.save(filename=os.path.join(pages_dir, page_filename))
When I try to parallelize it, the memory is growing and I cannot finish the processing of my PDF files:
def convert(dataset, resolution):
Parallel(n_jobs=-1, max_nbytes=None)(
delayed(convert_pdf)(filename, resolution) for filename in glob.iglob(dataset + '/**/*.pdf', recursive=True)
)
When I call the function in serial, the memory stay constant.
How joblib manage the memory allocation for each parallel instance?
How can I modify my code so that the memory stay constant when running in parallel?
Joblib will use serialization techniques to pass the data to all your workers. Of course the memory will grow with the number of workers.
From the docs :
By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.
There is no way to process 2 files in parallel with only the memory of 1 (if you really want a speedup)!
The docs also mention memory-maps which are often used for numerical-data and when those workers share data (OS is responsible for caching then). This won't help here because there is no shared data in your case. But as memory-maps are automatically kept memory-friendly in regards to caching, memory-based program crashes should not happen in this case, but of course this IO done (opposed to caching) will cost performance.
So in short:
n_jobs=4
for example
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.