简体   繁体   English

如何使用共享内存而不是通过多个进程之间的酸洗传递对象

[英]How to use shared memory instead of passing objects via pickling between multiple processes

I am working on a CPU intensive ML problem which is centered around an additive model.我正在研究一个以加法模型为中心的 CPU 密集型 ML 问题。 Since addition is the main operation I can divide the input data into pieces and spawn multiple models which are then merged by the overriden __add__ method.由于加法是主要操作,我可以将输入数据分成几部分并生成多个模型,然后通过覆盖的__add__方法合并这些模型。

The code relating to the multiprocessing looks like this:与多处理相关的代码如下所示:

def pool_worker(filename, doshuffle):
    print(f"Processing file: {filename}")
    with open(filename, 'r') as f:
        partial = FragmentModel(order=args.order, indata=f, shuffle=doshuffle)
        return partial

def generateModel(is_mock=False, save=True):
    model = None
    with ThreadPool(args.nthreads) as pool:
        from functools import partial
        partial_models = pool.imap_unordered(partial(pool_worker, doshuffle=is_mock), args.input)
        i = 0
        for m in partial_models:
            logger.info(f'Starting to merge model {i}')
            if model is None:
                import copy
                model = copy.deepcopy(m)
            else:
                model += m
            logger.info(f'Done merging...')
            i += 1

    return model

The issue is that the memory consumption scales exponentially as the model order increases, so at order 4 each instance of the model is about 4-5 GB, which causes the threadpool to crash as the intermediate model objects are then not pickleable.问题是内存消耗随着模型顺序的增加呈指数级增长,因此在顺序 4 时,模型的每个实例约为 4-5 GB,这会导致线程池崩溃,因为中间模型对象随后不可pickle。

I read about this a bit and it appears as even if the pickling is not an issue, it's still extremely inefficient to pass data like this, as commented to this answer .我读了一点,看起来即使酸洗不是问题,传递这样的数据仍然非常低效,正如对此答案的评论。

There is very little guidance as to how one can use shared memory for this purpose, however.然而,关于如何为此目的使用共享内存的指导很少。 Is it possible to avoid this problem without having to change the internals of the model object?是否可以避免这个问题而不必更改模型对象的内部结构?

Since Python 3.8, there is multiprocessing.shared_memory that enables direct memory sharing between processes, similar to "real" multi-threading in C or Java.从 Python 3.8 开始,有multiprocessing.shared_memory可以实现进程之间的直接内存共享,类似于 C 或 Java 中的“真正”多线程。 Direct memory sharing can be significantly faster than sharing via files, sockets, or data copy serialization/deserialization.直接内存共享比通过文件、套接字或数据副本序列化/反序列化共享要快得多。

It works by providing a shared memory buffer on which different processes can claim and declare variables, via a basic SharedMemory class instance or a more advanced SharedMemoryManager class instance.它通过提供一个共享内存缓冲区来工作,不同的进程可以通过一个基本的SharedMemory类实例或更高级的SharedMemoryManager类实例声明和声明变量。 Variables in basic python data types can be conveniently declared using the built-in ShareableList .可以使用内置的ShareableList方便地声明基本 Python 数据类型中的变量。 Variables in advanced data types such as numpy.ndarray can be shared by specifying the memory buffer when declaring. numpy.ndarray等高级数据类型中的变量可以通过在声明时指定内存缓冲区来共享。

Example with numpy.ndarray : numpy.ndarray示例:

import numpy as np
from multiprocessing import shared_memory

# setting up
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
d_shape = (len(data),)
d_type = np.int64
d_size = np.dtype(d_type).itemsize * np.prod(d_shape)

# IN THE MAIN PROCESS
# allocate new shared memory
shm = shared_memory.SharedMemory(create=True, size=d_size)
shm_name = shm.name
# numpy array on shared memory buffer
a = np.ndarray(shape=d_shape, dtype=d_type, buffer=shm.buf)
# copy data into shared memory ndarray once
a[:] = data[:]

# IN ANOTHER PROCESS
# reuse existing shared memory
ex_shm = shared_memory.SharedMemory(name=shm_name)
# numpy array b uses the same memory buffer as a
b = np.ndarray(shape=d_shape, dtype=d_type, buffer=ex_shm.buf)
# changes in b will be reflected in a and vice versa...
ex_shm.close()  # close after using

# IN THE MAIN PROCESS
shm.close()  # close after using
shm.unlink()  # free memory

In the above code, a and b arrays use the same underlying memory and can update the same memory directly, which can be very useful in machine learning.在上面的代码中, ab数组使用相同的底层内存,可以直接更新相同的内存,这在机器学习中非常有用。 However, you should beware of the concurrent update problems and decide how to handle them, either by using Lock /partitioned accesses/or accept stochastic updates (the so-called HogWild style).但是,您应该注意并发更新问题并决定如何处理它们,通过使用Lock /分区访问/或接受随机更新(所谓的 HogWild 风格)。

Use files!使用文件!

No, really, use files -- they are are efficient (OS will cache the content), and allow you to work on much larger problems (data set doesn't have to fit into RAM).不,真的,使用文件——它们是高效的(操作系统将缓存内容),并允许您处理更大的问题(数据集不必适合 RAM)。

Use any of https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.io.html to dump/load numpy arrays to/from files and only pass file names between the processes.使用任何https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.io.html将 numpy 数组转储到文件/从文件加载,并且仅在进程之间传递文件名。

PS benchmark serialisation methods, depending on the intermediate array size, the fastest could be "raw" (no conversion overhead) or "compressed" (if file ends up being written to disk) or something else. PS 基准序列化方法,取决于中间数组大小,最快的可能是“原始”(无转换开销)或“压缩”(如果文件最终被写入磁盘)或其他东西。 IIRC loading "raw" files may require knowing data format (dimensions, sizes) in advance. IIRC 加载“原始”文件可能需要提前了解数据格式(尺寸、大小)。

Check out the ray project which is a distributed execution framework that makes use of apache arrow for serialization.查看ray项目,它是一个使用apache 箭头进行序列化的分布式执行框架。 It's especially great if you're working with numpy arrays and hence is a great tool for ML workflows.如果您正在使用 numpy 数组,这尤其棒,因此它是 ML 工作流程的绝佳工具。

Here's a snippet from the docs on object serialization这是对象序列化文档中的一个片段

In Ray, we optimize for numpy arrays by using the Apache Arrow data format.在 Ray 中,我们使用 Apache Arrow 数据格式针对 numpy 数组进行了优化。 When we deserialize a list of numpy arrays from the object store, we still create a Python list of numpy array objects.当我们从对象存储中反序列化一个 numpy 数组列表时,我们仍然创建了一个 numpy 数组对象的 Python 列表。 However, rather than copy each numpy array, each numpy array object holds a pointer to the relevant array held in shared memory.但是,不是复制每个 numpy 数组,而是每个 numpy 数组对象都保存一个指向共享内存中保存的相关数组的指针。 There are some advantages to this form of serialization.这种形式的序列化有一些优点。

  • Deserialization can be very fast.反序列化可以非常快。
  • Memory is shared between processes so worker processes can all read the same data without having to copy it.内存在进程之间共享,因此工作进程都可以读取相同的数据而无需复制它。

In my opinion it's even easier to use than the multiprocessing library for parallel execution especially when looking to use shared memory, intro to usage in the tutorial .在我看来,它比并行执行的多处理库更容易使用,尤其是在寻求使用共享内存时,教程中的用法介绍。

You should use Manager proxy object for shared editable objects: https://docs.python.org/3/library/multiprocessing.html#multiprocessing-managers The access lock would be handled by that Manager proxy object.您应该将 Manager 代理对象用于共享的可编辑对象: https://docs.python.org/3/library/multiprocessing.html#multiprocessing-managers访问锁定将由该 Manager 代理对象处理。

In Customized managers section there is an example, that should suit you:定制经理部分有一个例子,应该适合你:

from multiprocessing.managers import BaseManager

class MathsClass:
    def add(self, x, y):
        return x + y
    def mul(self, x, y):
        return x * y

class MyManager(BaseManager):
    pass

MyManager.register('Maths', MathsClass)

if __name__ == '__main__':
    with MyManager() as manager:
        maths = manager.Maths()
        print(maths.add(4, 3))         # prints 7
        print(maths.mul(7, 8))         # prints 56

After that you have to connect from different processes (as shown in using a remote manager ) to that manager and edit it as you wish.之后,您必须从不同的进程(如使用远程管理器中所示)连接到该管理器并根据需要对其进行编辑。

Use multithreading.使用多线程。

Try to use multithreading instead of multiprocessing, as threads can share memory of main process natively.尝试使用多线程而不是多处理,因为线程可以本地共享主进程的内存。

If you worry about python's GIL mechanism, maybe you can resort to the nogil of numba .如果你担心 python 的 GIL 机制,也许你可以求助于numbanogil

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM