简体   繁体   中英

Sharing extension-allocated data structures across python processes

I am working on a python extension that does some specialized tree things on a large (5GB+) in-memory data structure.

The underlying data structure is already thread-safe (it uses RW locks), and I've written a wrapper around it that exposes it to python. It supports multiple simultaneous readers, only one writer, and I'm using a pthread_rwlock for access synchronization. Since the application is very read-heavy, multiple readers should provide a decent performance improvement (I hope).

However, I cannot determine what the proper solution is to allow the extension data-store to be shared across multiple python processes accessed via the multiprocessing module.

Basically, I'd like something that looks like the current multiprocessing.Value / multiprocessing.Array system, but the caveat here is that my extension is allocating all it's own memory in C++.

How can I allow multiple processes to access my shared data structure?

Sources are here (C++ Header-only library), and here (Cython wrapper).

Right now, if I build a instance of the tree, and then pass references out to multiple processes, it fails with a serialization error:

Traceback (most recent call last):
  File "/usr/lib/python3.4/multiprocessing/queues.py", line 242, in _feed
    obj = ForkingPickler.dumps(obj)
  File "/usr/lib/python3.4/multiprocessing/reduction.py", line 50, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.TextIOWrapper' object

( failing test-case )

I'm currently releasing the GIL in my library, but there are some future tasks that would greatly benefit from independent processes, and I'd like to avoid having to implement a RPC system for talking to the BK tree.

If the extension data is going to exist as a single logical mutable object across multiple processes (so a change in Process A will be reflected in the view in Process B), you can't avoid some sort of IPC mechanism. The memory spaces of two processes are separate; Python can't magically share unshared data.

The closest you could get (without explicitly using shared memory at the C layer that could be used to allow mapping the same memory into each process) would be to use a custom subclass of multiprocessing.BaseManager , which would just hide the IPC from you (the actual object would live in a single process, with other processes proxying to that original object). You can see a really simple example in the multiprocessing docs .

The manager approach is simple, but performance-wise it's probably not going to do so hot; shared memory at the C layer avoids a lot of overhead that the proxying mechanism can't avoid. You'd need to test to verify anything. Making C++ STL use shared memory would be a royal pain to my knowledge, probably not worth the trouble, so unless the manager approach is shown to be too slow, I'd avoid even attempting the optimization.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM