避免multiprocessing.Pool工作者之间的不可用共享状态的全局变量

Question

我经常发现自己用Python编写程序来构造一个大的（兆字节）只读数据结构，然后使用该数据结构来分析一个非常大的（总共几百兆字节）小记录列表。 可以并行分析每个记录，因此自然模式是设置只读数据结构并将其分配给全局变量，然后创建multiprocessing.Pool （将数据结构隐式复制到每个工作进程中，通过fork ）然后使用imap_unordered并行处理记录。 这种模式的骨架看起来像这样：

classifier = None
def classify_row(row):
    return classifier.classify(row)

def classify(classifier_spec, data_file):
    global classifier
    try:
        classifier = Classifier(classifier_spec)
        with open(data_file, "rt") as fp, \
             multiprocessing.Pool() as pool:
            rd = csv.DictReader(fp)
            yield from pool.imap_unordered(classify_row, rd)
    finally:
        classifier = None

由于全局变量以及classify和classify_row之间的隐式耦合，我对此并不满意。 理想情况下，我想写

def classify(classifier_spec, data_file):
    classifier = Classifier(classifier_spec)
    with open(data_file, "rt") as fp, \
         multiprocessing.Pool() as pool:
        rd = csv.DictReader(fp)
        yield from pool.imap_unordered(classifier.classify, rd)

但是这不起作用，因为Classifier对象通常包含无法pickle的对象（因为它们是由作者不关心的扩展模块定义的）; 我还读过如果它确实有效会很慢，因为在每次调用绑定方法时，Classifier对象都会被复制到工作进程中。

还有更好的选择吗？ 我只关心3.x.

Answer 1

这非常棘手。 这里的关键是保留对fork时可用而没有序列化的变量的读访问。 在多处理中共享内存的大多数解决方案最终都会序列化。 我尝试使用weakref.proxy来传递没有序列化的分类器，但是这不起作用，因为dill和pickle都会尝试跟踪并序列化引用。 但是，模块引用工作。

这个组织让我们接近：

import multiprocessing as mp
import csv


def classify(classifier, data_file):

    with open(data_file, "rt") as fp, mp.Pool() as pool:
        rd = csv.DictReader(fp)
        yield from pool.imap_unordered(classifier.classify, rd)


def orchestrate(classifier_spec, data_file):
    # construct a classifier from the spec; note that we can
    # even dynamically import modules here, using config values
    # from the spec
    import classifier_module
    classifier_module.init(classifier_spec)
    return classify(classifier_module, data_file)


if __name__ == '__main__':
    list(orchestrate(None, 'data.txt'))

这里需要注意一些变化：

我们为一些DI优点添加了一个orchestrate方法; 编排了如何构建/初始化分类器，并将其classify ，将两者分离
classify只需要假设classifier参数有一个classify方法; 它不关心它是一个实例还是一个模块

对于这个概念证明，我们提供了一个显然不可序列化的分类器：

# classifier_module.py
def _create_classifier(spec):

    # obviously not pickle-able because it's inside a function
    class Classifier():

        def __init__(self, spec):
            pass

        def classify(self, x):
            print(x)
            return x

    return Classifier(spec)


def init(spec):
    global __classifier
    __classifier = _create_classifier(spec)


def classify(x):
    return __classifier.classify(x)

不幸的是，这里仍然有一个全局，但它现在很好地封装在一个模块内作为私有变量，并且模块导出一个由classify和init函数组成的紧密接口。

这种设计解锁了一些可能性：

orchestrate可以根据它在classifier_spec看到的内容导入和初始化不同的分类器模块
一个也可以传递一个Classifier类的实例进行classify ，只要这个实例是可序列化的并且具有相同签名的classify方法

Answer 2

如果你想使用分叉，我看不到使用全局的方法。 但是我也没有看到为什么你不得不在这种情况下使用全局变得不好的原因，你不是在操纵具有多线程的全局列表。

但是，在你的例子中，可以应对丑陋。 您希望直接传递classifier.classify ，但Classifier对象包含无法pickle的对象。

import os
import csv
import uuid
from threading import Lock
from multiprocessing import Pool
from weakref import WeakValueDictionary

class Classifier:

    def __init__(self, spec):
        self.lock = Lock()  # unpickleable
        self.spec = spec

    def classify(self, row):
        return f'classified by pid: {os.getpid()} with spec: {self.spec}', row

我建议大家子类Classifier和定义__getstate__和__setstate__使酸洗。 因为你无论如何都在使用分叉，所有它必须腌制的状态，是如何获得对分叉全局实例的引用的信息。 然后我们将使用分叉实例的__dict__更新pickle对象的__dict__ __dict__ （它没有通过减少酸洗）并且您的实例再次完成。

为了在没有额外样板的情况下实现这一点，子类化的Classifier实例必须为自己生成一个名称并将其注册为全局变量。 第一个引用将是一个弱引用，因此实例可以在用户期望时进行垃圾回收。 第二个引用由用户在分配classifier = Classifier(classifier_spec) 。 这个，不一定是全球性的。

以下示例中生成的名称是在standard-lib的uuid模块的帮助下生成的。 uuid被转换为字符串并被编辑为有效的标识符（它不一定是，但它在交互模式下调试很方便）。

class SubClassifier(Classifier):

    def __init__(self, spec):
        super().__init__(spec)
        self.uuid = self._generate_uuid_string()
        self.pid = os.getpid()
        self._register_global()

    def __getstate__(self):
        """Define pickled content."""
        return {'uuid': self.uuid}

    def __setstate__(self, state):
        """Set state in child process."""
        self.__dict__ = state
        self.__dict__.update(self._get_instance().__dict__)

    def _get_instance(self):
        """Get reference to instance."""
        return globals()[self.uuid][self.uuid]

    @staticmethod
    def _generate_uuid_string():
        """Generate id as valid identifier."""
        # return 'uuid_' + '123' # for testing
        return 'uuid_' + str(uuid.uuid4()).replace('-', '_')

    def _register_global(self):
        """Register global reference to instance."""
        weakd = WeakValueDictionary({self.uuid: self})
        globals().update({self.uuid: weakd})

    def __del__(self):
        """Clean up globals when deleted in parent."""
        if os.getpid() == self.pid:
            globals().pop(self.uuid)

这里的甜点是，样板完全消失了。 由于实例在后台管理所有内容，因此您无需手动处理声明和删除全局变量：

def classify(classifier_spec, data_file, n_workers):
    classifier = SubClassifier(classifier_spec)
    # assert globals()['uuid_123']['uuid_123'] # for testing
    with open(data_file, "rt") as fh, Pool(n_workers) as pool:
        rd = csv.DictReader(fh)
        yield from pool.imap_unordered(classifier.classify, rd)


if __name__ == '__main__':

    PATHFILE = 'data.csv'
    N_WORKERS = 4

    g = classify(classifier_spec='spec1', data_file=PATHFILE, n_workers=N_WORKERS)
    for record in g:
        print(record)

   # assert 'uuid_123' not in globals() # no reference left

Answer 3

multiprocessing.sharedctypes模块提供了从共享内存中分配ctypes对象的功能，这些对象可以由子进程继承，即父进程和子进程可以访问共享内存。

你可以用
1. multiprocessing.sharedctypes.RawArray从共享内存中分配ctypes数组。
2. multiprocessing.sharedctypes.RawValue从共享内存中分配ctypes对象。

王绵之博士就此发表了非常详细的文件。 您可以共享多个multiprocessing.sharedctypes对象。

您可能会发现此处的解决方案对您有用。

避免multiprocessing.Pool工作者之间的不可用共享状态的全局变量

问题描述

3 个解决方案

解决方案1
5 2018-10-07 04:02:17

解决方案2
2 2018-10-10 07:20:32

解决方案3
-1 2018-10-07 14:46:54

避免multiprocessing.Pool工作者之间的不可用共享状态的全局变量

问题描述

3 个解决方案

解决方案1 5 2018-10-07 04:02:17

解决方案2 2 2018-10-10 07:20:32

解决方案3 -1 2018-10-07 14:46:54

解决方案1
5 2018-10-07 04:02:17

解决方案2
2 2018-10-10 07:20:32

解决方案3
-1 2018-10-07 14:46:54