简体   繁体   English

使用未传递参数的 Dask 延迟 function 调用

[英]Dask delayed function call with non-passed parameters

I am seeking to better understand the following behavior when using dask.delayed to call a function that depends on parameters.在使用dask.delayed调用取决于参数的 function 时,我正在寻求更好地理解以下行为。 The issue seems to arise when parameters are specified in a parameters file read by configparser.当在 configparser 读取的参数文件中指定参数时,似乎会出现此问题。 Here is a complete example:这是一个完整的例子:

parameter file:参数文件:

#zpar.ini: parameter file for configparser

[my pars]
my_zpar = 2.

parser:解析器:

#zippy_parser
import configparser

def read(_rundir):

    global rundir
    rundir = _rundir

    cp = configparser.ConfigParser()
    cp.read(rundir + '/zpar.ini')

    #[my pars]
    global my_zpar
    my_zpar = cp['my pars'].getfloat('my_zpar')

and the main python file:和主 python 文件:

# dask test with configparser
import dask
from dask.distributed import Client
import zippy_parser as zpar


def my_func(x, y):

    # print stuff
    print("parameter from main is: {}".format(main_par))
    print("parameter from configparser is: {}".format(zpar.my_zpar))

    # do stuff
    return x + y


if __name__ == '__main__':

    client = Client(n_workers = 4)

    #read parameters from input file
    rundir = '/path/to/parameter/file'
    zpar.read(rundir)

    #test zpar
    print("zpar is {}".format(zpar.my_zpar))

    #define parameter and call my_func
    main_par = 5.
    z = dask.delayed(my_func)(1., 2.)
    z.compute()

    client.close()

The first print statement in my_func() executes just fine, but the second print statement raises an exception. my_func() 中的第一个 print 语句执行得很好,但第二个 print 语句引发了异常。 The output is: output 是:

 zpar is 2.0 parameter from main is: 5.0 distributed.worker - WARNING - Compute Failed Function: my_func args: (1.0, 2.0) kwargs: {} Exception: AttributeError("module 'zippy_parser' has no attribute 'my_zpar'",)

I am new to dask.我是新手。 I suppose this has something to do with the serialization, which I do not understand.我想这与序列化有关,我不明白。 Can someone enlighten me and/or point to relevant documentation?有人可以启发我和/或指出相关文件吗? Thanks!谢谢!

I will try to keep this brief.我会尽量保持简短。

When a function is serialised in order to be sent to workers, python also sends local variables and functions needed by the function (its "closure").当 function 被序列化以便发送给工作人员时,python 还发送 function 所需的局部变量和函数(其“闭包”)。 However, it stores the modules it references by name, it does not try to serialise your whole runtime.但是,它按名称存储它引用的模块,它不会尝试序列化您的整个运行时。 This means that zippy_parser is imported in the worker, not deserialised.这意味着zippy_parser是在 worker 中导入的,而不是反序列化的。 Since the function read has never been called in the worker, the global variable is never initialised.由于 function read从未在工作程序中调用过,因此从未初始化global变量。

So, you could call read in the workers as part of your function or otherwise, but probably the pattern or setting module-global variables from with a function isn't great.因此,您可以在工作人员中调用read作为 function 或其他方式的一部分,但使用 function 的模式或设置模块全局变量可能不是很好。 Dask's delayed mechanism prefers functional purity, that the result you get should not depend on the current state of the runtime. Dask 的延迟机制更喜欢功能纯度,您获得的结果不应依赖于运行时的当前 state。

(note that if you had created the client after calling read in the main script, the workers might have got the in-memory version, depending on how subprocesses are configured to be created on your system) (请注意,如果您在主脚本中调用read之后创建了客户端,则工作人员可能已经获得了内存版本,具体取决于如何配置在您的系统上创建子进程)

I encourage you to pass in all parameters to your dask delayed functions explicitly, rather than relying on the global namespace.我鼓励您将所有参数显式传递给您的 dask 延迟函数,而不是依赖于全局命名空间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM