简体   繁体   中英

How can scipy.weave.inline be used in a MPI-enabled application on a cluster?

If scipy.weave.inline is called inside a massive parallel MPI-enabled application that is run on a cluster with a home-directory that is common to all nodes, every instance accesses the same catalog for compiled code: $HOME/.pythonxx_compiled. This is bad for obvious reasons and leads to many error messages. How can this problem be circumvented?

As per the scipy docs , you could store your compiled data in a directory that isn't on the NFS share (such as /tmp or /scratch or whatever is available for your system). Then you wouldn't have to worry about your conflicts. You just need to set the PYTHONCOMPILED environment variable to something else.

My previous thoughts about this problem:

Either scipy.weave.catalog has to be enhanced with a proper locking mechanism in order to serialize access to the catalog, or every instance has to use its own catalog.

I chose the latter. The scipy.weave.inline function uses a catalog which is bound to the module-level name function_catalog of the scipy.weave.inline module. This can be discovered by looking into the code of this module ( https://github.com/scipy/scipy/tree/v0.12.0/scipy/weave ).

The simples solution is now to monkeypatch this name to something else at the beginning of the program:

from mpi4py import MPI

import numpy as np

import scipy.weave.inline_tools
import scipy.weave.catalog

import os
import os.path

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

catalog_dir = os.path.join(some_path,  'rank'+str(rank))
try:
    os.makedirs(catalog_dir)
except OSError:
    pass

#monkeypatching the catalog
scipy.weave.inline_tools.function_catalog = scipy.weave.catalog.catalog(catalog_dir)

Now inline works smoothly: Each instance has its own catalog inside the common NFS directory. Of course this naming scheme breaks if two distinct parallel tasks ran at the same time, but this would also be the case if the catalog was in /tmp.

Edit : As mentioned in a comment above this procedure still bears problems if multiple indepedent jobs are run in parallel. This can be remedied by adding a random uuid to the pathname:

import uuid

u = None
if rank == 0:
    u = str(uuid.uuid4())

u = comm.scatter([u]*size, root=0)

catalog_dir = os.path.join('/tmp/<username>/pythoncompiled',  u+'-'+str(rank))
os.makedirs(catalog_dir)

#monkeypatching the catalog
scipy.weave.inline_tools.function_catalog = scipy.weave.catalog.catalog(catalog_dir)

Of course it would be nice to delete those files after the computation:

shutil.rmtree(catalog_dir)

Edit : There were some additional problems. The intermediate directory where cpp and o files are stored also hat some trouble due to simultaneous access from different instances, so the above method has to be extended to this directory:

basetmp = some_path
catalog_dir = os.path.join(basetmp, 'pythoncompiled',  u+'-'+str(rank))
intermediate_dir = os.path.join(basetmp, 'pythonintermediate',  u+'-'+str(rank))

os.makedirs(catalog_dir, mode=0o700)
os.makedirs(intermediate_dir, mode=0o700)

#monkeypatching the catalog and intermediate_dir
scipy.weave.inline_tools.function_catalog = scipy.weave.catalog.catalog(catalog_dir)
scipy.weave.catalog.intermediate_dir = lambda: intermediate_dir

#... calculations here ...

shutil.rmtree(catalog_dir)
shutil.rmtree(intermediate_dir)

一个快速的解决方法是在每个节点上使用本地目录(例如Wesley所说的/ tmp),但如果您有容量,则每个节点使用一个MPI任务。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM