具有共享数据源和多个类实例的Python多处理

Question

My program needs to spawn multiple instances of a class, each processing data that is coming from a streaming data source. 我的程序需要产生一个类的多个实例，每个实例都处理来自流数据源的数据。

For example: 例如：

parameters = [1, 2, 3]

class FakeStreamingApi:
    def __init__(self):
        pass

    def data(self):
        return 42
    pass

class DoStuff:
    def __init__(self, parameter):
        self.parameter = parameter

    def run(self):
        data = streaming_api.data()
        output = self.parameter ** 2 + data # Some CPU intensive task
        print output

streaming_api = FakeStreamingApi()

# Here's how this would work with no multiprocessing
instance_1 = DoStuff(parameters[0])
instance_1.run()

Once the instances are running they don't need to interact with each other, they just have to get the data as it comes in. (and print error messages, etc) 实例运行后，它们无需彼此交互，只需获取数据即可。（还有打印错误消息等）

I am totally at a loss how to make this work with multiprocessing, since I first have to create a new instance of the class DoStuff, and then have it run. 我完全不知如何使用多处理功能，因为我首先必须创建类DoStuff的新实例，然后再运行它。

This is definitely not the way to do it: 绝对不是这样做的方法：

# Let's try multiprocessing
import multiprocessing

for parameter in parameters:
    processes = [ multiprocessing.Process(target = DoStuff, args = (parameter)) ]

# Hmm, this doesn't work...

We could try defining a function to spawn classes, but that seems ugly: 我们可以尝试定义一个函数来生成类，但这看起来很丑：

import multiprocessing

def spawn_classes(parameter):
    instance = DoStuff(parameter)
    instance.run()

for parameter in parameters:
        processes = [ multiprocessing.Process(target = spawn_classes, args = (parameter,)) ]

# Can't tell if it works -- no output on screen?

Plus, I don't want to have 3 different copies of the API interface class running, I want that data to be shared between all the processes... and as far as I can tell, multiprocessing creates copies of everything for each new process. 另外，我不想运行3个不同的API接口类副本，我希望在所有进程之间共享数据...据我所知，多处理会为每个新进程创建所有副本。。

Ideas? 有想法吗？

Edit: I think I may have got it... is there anything wrong with this? 编辑：我想我可能已经知道了...这有什么问题吗？

import multiprocessing

parameters = [1, 2, 3]

class FakeStreamingApi:
    def __init__(self):
        pass

    def data(self):
        return 42
    pass

class Worker(multiprocessing.Process):
    def __init__(self, parameter):
        super(Worker, self).__init__()
        self.parameter = parameter

    def run(self):
        data = streaming_api.data()
        output = self.parameter ** 2 + data # Some CPU intensive task
        print output

streaming_api = FakeStreamingApi()

if __name__ == '__main__':
    jobs = []
    for parameter in parameters:
        p = Worker(parameter)
        jobs.append(p)
        p.start()
    for j in jobs:
        j.join()

Answer 1

I came to the conclusion that it would be necessary to use multiprocessing.Queues to solve this. 我得出的结论是，有必要使用multiprocessing.Queues来解决这个问题。 The data source (the streaming API) needs to pass copies of the data to all the different processes, so they can consume it. 数据源（流API）需要将数据的副本传递给所有不同的进程，以便它们可以使用它。

There's another way to solve this using the multiprocessing.Manager to create a shared dict, but I didn't explore it further, as it looks fairly inefficient and cannot propagate changes to inner values (eg if you have a dict of lists, changes to the inner lists will not propagate). 还有另一种方法可以使用multiprocessing.Manager来创建共享字典，但是我没有对其进行进一步的研究，因为它看起来效率很低并且无法传播对内部值的更改（例如，如果您有列表的字典，则更改为内部列表将不会传播）。

具有共享数据源和多个类实例的Python多处理

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-04-08 04:35:47

具有共享数据源和多个类实例的Python多处理

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-04-08 04:35:47

解决方案1
0 已采纳 2018-04-08 04:35:47