简体   繁体   English

如何在 AWS Lambda 中模拟 multiprocessing.Pool.map()?

[英]How to emulate multiprocessing.Pool.map() in AWS Lambda?

Python on AWS Lambda does not support multiprocessing.Pool.map() , as documented in this other question . AWS Lambda 上的 Python 不支持multiprocessing.Pool.map() ,如另一个问题中所述 Please note that the other question was asking why it doesn't work.请注意,另一个问题是问为什么它不起作用。 This question is different, I'm asking how to emulate the functionality given the lack of underlying support.这个问题不同,我问的是在缺乏底层支持的情况下如何模拟功能。

One of the answers to that other question gave us this code:另一个问题的答案之一为我们提供了以下代码:

# Python 3.6
from multiprocessing import Pipe, Process

def myWorkFunc(data, connection):
    result = None

    # Do some work and store it in result

    if result:
        connection.send([result])
    else:
        connection.send([None])


def myPipedMultiProcessFunc():

    # Get number of available logical cores
    plimit = multiprocessing.cpu_count()

    # Setup management variables
    results = []
    parent_conns = []
    processes = []
    pcount = 0
    pactive = []
    i = 0

    for data in iterable:
        # Create the pipe for parent-child process communication
        parent_conn, child_conn = Pipe()
        # create the process, pass data to be operated on and connection
        process = Process(target=myWorkFunc, args=(data, child_conn,))
        parent_conns.append(parent_conn)
        process.start()
        pcount += 1

        if pcount == plimit: # There is not currently room for another process
            # Wait until there are results in the Pipes
            finishedConns = multiprocessing.connection.wait(parent_conns)
            # Collect the results and remove the connection as processing
            # the connection again will lead to errors
            for conn in finishedConns:
                results.append(conn.recv()[0])
                parent_conns.remove(conn)
                # Decrement pcount so we can add a new process
                pcount -= 1

    # Ensure all remaining active processes have their results collected
    for conn in parent_conns:
        results.append(conn.recv()[0])
        conn.close()

    # Process results as needed

Can this sample code be modified to support multiprocessing.Pool.map() ?可以修改此示例代码以支持multiprocessing.Pool.map()吗?

What have I tried so far到目前为止我尝试了什么

I analysed the above code and I do not see a parameter for the function to be executed or the data, so I'm inferring that it does not perform the same function as multiprocessing.Pool.map() .我分析了上面的代码,我没有看到要执行的函数的参数或数据,所以我推断它没有执行与multiprocessing.Pool.map()相同的功能。 It is not clear what the code does, other than demonstrating the building blocks that could be assembled into a solution.除了演示可以组装成解决方案的构建块之外,尚不清楚代码的作用。

Is this a "write my code for me" question?这是“为我编写代码”的问题吗?

Yes to some extent, it is.是的,在某种程度上,确实如此。 This issue impacts thousands of Python developers, and it would be far more efficient for the world economy, less green-house gas emissions, etc if all of us share the same code, instead of forcing every SO user who encounters this to go and develop their own workaround.这个问题影响了成千上万的 Python 开发人员,如果我们所有人共享相同的代码,而不是强迫每个遇到这个问题的 SO 用户去开发,它对世界经济的效率会更高,温室气体排放更少,等等他们自己的解决方法。 I hope I've done my part by distilling this into a clear question with the presumed building blocks ready to go.我希望我已经完成了我的工作,将这个问题提炼成一个明确的问题,并准备好了假定的构建块。

I was able to get this working for my own tests.我能够在我自己的测试中使用它。 I've based my code on this link : https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/我的代码基于此链接: https : //aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/

NB1: you MUST increase memory allocation to the lambda function . NB1:您必须增加对 lambda 函数的内存分配 with the default minimal amount, there's no increase in performance with multiprocessing.使用默认的最小数量,多处理不会提高性能。 With the maximum my account can allocate (3008MB) the figures below were attained.我的帐户可以分配的最大值 (3008MB) 达到以下数字。

NB2: I'm completely ignoring max processes in parallel here. NB2:我在这里完全忽略了并行的最大进程。 My usage doesn't have a whole lot of elements to work on.我的用法没有很多元素需要处理。

with the code below, usage is:使用下面的代码,用法是:

work = funcmap(yourfunction,listofstufftoworkon)
yourresults = work.run()

running from my laptop:从我的笔记本电脑运行:

jumper@jumperdebian[3333] ~/scripts/tmp  2019-09-04 11:52:30
└─ $ ∙ python3 -c "import tst; tst.lambda_handler(None,None)"
results : [(35, 9227465), (35, 9227465), (35, 9227465), (35, 9227465)]
SP runtime : 9.574460506439209
results : [(35, 9227465), (35, 9227465), (35, 9227465), (35, 9227465)]
MP runtime : 6.422513484954834

running from aws:从 aws 运行:

Function Logs:
START RequestId: 075a92c0-7c4f-4f48-9820-f394ee899a97 Version: $LATEST
results : [(35, 9227465), (35, 9227465), (35, 9227465), (35, 9227465)]
SP runtime : 12.135798215866089
results : [(35, 9227465), (35, 9227465), (35, 9227465), (35, 9227465)]
MP runtime : 7.293526887893677
END RequestId: 075a92c0-7c4f-4f48-9820-f394ee899a97

Here's the test code:这是测试代码:

import time
from multiprocessing import Process, Pipe
import boto3

class funcmap(object):

    fmfunction=None
    fmlist=None

    def __init__(self,pfunction,plist):
        self.fmfunction=pfunction
        self.fmlist=plist

    def calculation(self, pfunction, pload, conn):
        panswer=pfunction(pload)
        conn.send([pload,panswer])
        conn.close()

    def run(self):
        datalist = self.fmlist
        processes = []
        parent_connections = []
        for datum in datalist:
            parent_conn, child_conn = Pipe()
            parent_connections.append(parent_conn)
            process = Process(target=self.calculation, args=(self.fmfunction, datum, child_conn,))
            processes.append(process)

        pstart=time.time()
        for process in processes:
            process.start()
            #print("starting at t+ {} s".format(time.time()-pstart))
        for process in processes:
            process.join()
            #print("joining at t+ {} s".format(time.time()-pstart))

        results = []
        for parent_connection in parent_connections:
            resp=parent_connection.recv()
            results.append((resp[0],resp[1]))
        return results


def fibo(n):
    if n <= 2 : return 1
    return fibo(n-1)+fibo(n-2)

def lambda_handler(event, context):
    #worklist=[22,23,24,25,26,27,28,29,30,31,32,31,30,29,28,27,26,27,28,29]
    #worklist=[22,23,24,25,26,27,28,29,30]
    worklist=[30,30,30,30]
    #worklist=[30]
    _start = time.time()
    results=[]
    for a in worklist:
        results.append((a,fibo(a)))
    print("results : {}".format(results))
    _end = time.time()
    print("SP runtime : {}".format(_end-_start))

    _mstart = time.time()
    work = funcmap(fibo,worklist)
    results = work.run()
    print("results : {}".format(results))
    _mend = time.time()
    print("MP runtime : {}".format(_mend-_mstart))

hope it helps.希望能帮助到你。

I had the same issue, and ended up implementing my own simple wrapper around multiprocessing.Pool .我遇到了同样的问题,最终围绕multiprocessing.Pool实现了我自己的简单包装器。 Definitely not bullet proof, but enough for simple use cases as drop-in replacement.绝对不是防弹的,但对于简单的用例来说已经足够了。

https://stackoverflow.com/a/63633248/158049 https://stackoverflow.com/a/63633248/158049

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM