简体   繁体   English

在Python多处理中使用__name __ =='__ main__'的解决方法

[英]Workaround for using __name__=='__main__' in Python multiprocessing

As we all know we need to protect the main() when running code with multiprocessing in Python using if __name__ == '__main__' . 众所周知,我们需要在使用if __name__ == '__main__'在Python中使用multiprocessing运行代码时保护main()

I understand that this is necessary in some cases to give access to functions defined in the main but I do not understand why this is necessary in this case: 我理解在某些情况下这是必要的,以提供对main中定义的函数的访问,但我不明白为什么在这种情况下这是必要的:

file2.py

import numpy as np
from multiprocessing import Pool
class Something(object):
    def get_image(self):
        return np.random.rand(64,64)

    def mp(self):
        image = self.get_image()
        p = Pool(2)
        res1 = p.apply_async(np.sum, (image,))
        res2 = p.apply_async(np.mean, (image,))
        print(res1.get())
        print(res2.get())
        p.close()
        p.join()

main.py

from file2 import Something
s = Something()
s.mp()

All of the functions or imports necessary for Something to work are part of file2.py . Something工作所需的所有函数或导入都是file2.py一部分。 Why does the subprocess need to re-run the main.py ? 为什么子main.py需要重新运行main.py

I think the __name__ solution is not very nice as this prevents me from distribution the code of file2.py as I can't make sure they are protecting their main. 我认为__name__解决方案不是很好,因为这阻止我分发file2.py的代码,因为我不能确保它们保护他们的主要。 Isn't there a workaround for Windows? Windows没有解决方法吗? How are packages solving that (as I never encountered any problem not protecting my main with any package - are they just not using multiprocessing?) 如何解决这个问题(因为我从来没有遇到任何问题,没有用任何软件包来保护我的主程序 - 他们只是不使用多处理?)

edit: I know that this is because of the fork() not implemented in Windows. 编辑:我知道这是因为在Windows中没有实现fork() I was just asking if there is a hack to let the interpreter start at file2.py instead of main.py as I can be sure that file2.py is self-sufficient 我只是问是否有一个hack让解释器从file2.py而不是main.py开始,因为我可以肯定file2.py是自给自足的

When using the "spawn" start method, new processes are Python interpreters that are started from scratch. 使用“spawn”start方法时,新进程是从头开始的Python解释器。 It's not possible for the new Python interpreters in the subprocesses to figure out what modules need to be imported, so they import the main module again, which in turn will import everything else. 子进程中的新Python解释器无法确定需要导入哪些模块,因此它们会再次导入主模块,而后者又会导入其他所有模块。 This means it must be possible to import the main module without any side effects. 这意味着必须可以导入主模块而没有任何副作用。

If you are on a different platform than Windows, you can use the "fork" start method instead, and you won't have this problem. 如果您使用的是与Windows不同的平台,则可以使用“fork”启动方法,而不会出现此问题。

That said, what's wrong with using if __name__ == "__main__": ? 那就是说,使用if __name__ == "__main__":有什么问题if __name__ == "__main__": It has a lot of additional benefits, eg documentation tools will be able to process your main module, and unit testing is easier etc, so you should use it in any case. 它有许多额外的好处,例如文档工具将能够处理您的主模块,单元测试更容易等,所以您应该在任何情况下使用它。

the if __name__ == '__main__' is needed on windows since windows doesnt have a "fork" option for processes. Windows上需要if __name__ == '__main__' ,因为windows没有进程的“fork”选项。

In linux, for example, you can fork the process, so the parent process will be copied and the copy will become the child process (and it will have access to the already imported code you had loaded in the parent process) 例如,在linux中,您可以fork进程,因此将复制父进程,并且副本将成为子进程(并且它将有权访问您在父进程中加载​​的已导入的代码)

Since you cant fork in windows, python simply imports all the code that was imported by the parent process, in the child process. 由于你无法在Windows中进行分叉,因此python只是在子进程中导入父进程导入的所有代码。 This creates a similar effect, but if you dont do the __name__ trick, this import will execute your code again in the child process (and this will make it create it own child, and so on). 这会产生类似的效果,但如果您不执行__name__技巧,则此导入将在子进程中再次执行您的代码(这将使其创建自己的子进程,依此类推)。

so even in your example main.py will be imported again (since all the files are imported again). 因此,即使在您的示例中, main.py也会再次导入(因为所有文件都会再次导入)。 python cant guess what specific python script the child process should import. python猜不到子进程应该导入什么特定的python脚本。

FYI there are other limitations you should be aware of like using globals, you can read about it here https://docs.python.org/2/library/multiprocessing.html#windows 仅供参考,你应该注意使用全局变量的其他限制,你可以在这里阅读它https://docs.python.org/2/library/multiprocessing.html#windows

The main module is imported (but with __name__ != '__main__' because Windows is trying to simulate a forking-like behavior on a system that doesn't have forking). 主模块已导入(但使用__name__ != '__main__'因为Windows正在尝试在没有分叉的系统上模拟类似分叉的行为)。 multiprocessing has no way to know that you didn't do anything important in you main module, so the import is done "just in case" to create an environment similar to the one in your main process. multiprocessing无法知道您在主模块中没有做任何重要事情,因此导入是“以防万一”来创建类似于主进程中的环境。 If it didn't do this, all sorts of stuff that happens by side-effect in main (eg imports, configuration calls with persistent side-effects, etc.) might not be properly performed in the child processes. 如果它没有这样做,那么在main中出现副作用的各种东西(例如导入,具有持久副作用的配置调用等)可能无法在子进程中正确执行。

As such, if they're not protecting their __main__ , the code is not multiprocessing safe (nor is it unittest safe, import safe, etc.). 因此,如果他们不保护他们的__main__ ,代码不是多处理安全的(也不是unittest安全,导入安全等)。 The if __name__ == '__main__': protective wrapper should be part of all correct main modules. if __name__ == '__main__':保护包装应该是所有正确主模块的一部分。 Go ahead and distribute it, with a note about requiring multiprocessing-safe main module protection. 继续进行分发,并附上有关要求多处理安全主模块保护的说明。

As others have mentioned the spawn() method on Windows will re-import the code for each instance of the interpreter. 正如其他人所提到的,Windows上的spawn()方法将为每个解释器实例重新导入代码。 This import will execute your code again in the child process (and this will make it create it own child, and so on). 此导入将在子进程中再次执行您的代码(这将使其创建自己的子进程,依此类推)。

A workaround is to pull the multiprocessing script into a separate file and then use subprocess to launch it from the main script. 解决方法是将多处理脚本拉入单独的文件,然后使用子进程从主脚本启动它。

I pass variables into the script by pickling them in a temporary directory, and I pass the temporary directory into the subprocess with argparse. 我通过在临时目录中将变量绑定到脚本中来传递变量,然后使用argparse将临时目录传递到子进程中。

I then pickle the results into the temporary directory, where the main script retrieves them. 然后我将结果挑选到临时目录中,主脚本在其中检索它们。

Here is an example file_hasher() function that I wrote: 这是我写的一个示例file_hasher()函数:

main_program.py main_program.py

import os, pickle, shutil, subprocess, sys, tempfile

def file_hasher(filenames):
    try:
        subprocess_directory = tempfile.mkdtemp()
        input_arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
        with open(input_arguments_file, 'wb') as func_inputs:
            pickle.dump(filenames, func_inputs)
        current_path = os.path.dirname(os.path.realpath(__file__))
        file_hasher = os.path.join(current_path, 'file_hasher.py')
        python_interpreter = sys.executable
        proc = subprocess.call([python_interpreter, file_hasher, subprocess_directory],
                               timeout=60, 
                              )
        output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
        with open(output_file, 'rb') as func_outputs:
            hashlist = pickle.load(func_outputs)
    finally:
        shutil.rmtree(subprocess_directory)
    return hashlist

file_hasher.py file_hasher.py

#! /usr/bin/env python
import argparse, hashlib, os, pickle
from multiprocessing import Pool

def file_hasher(input_file):
    with open(input_file, 'rb') as f:
        data = f.read()
        md5_hash = hashlib.md5(data)
    hashval = md5_hash.hexdigest()
    return hashval

if __name__=='__main__':
    argument_parser = argparse.ArgumentParser()
    argument_parser.add_argument('subprocess_directory', type=str)
    subprocess_directory = argument_parser.parse_args().subprocess_directory

    arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
    with open(arguments_file, 'rb') as func_inputs:
        filenames = pickle.load(func_inputs)

    hashlist = []
    p = Pool()
    for r in p.imap(file_hasher, filenames):
        hashlist.append(r)

    output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
    with open(output_file, 'wb') as func_outputs:
        pickle.dump(hashlist, func_outputs)

There must be a better way... 一定会有更好的办法...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Windows 上的 python 多处理,如果 __name__ == “__main__” - python multiprocessing on windows, if __name__ == “__main__” 使用python-multiprocessing与if __name__ =='__main__'相关的谜 - enigma using python-multiprocessing related with if __name__ == '__main__' 没有“if __name__ =='__ main__'的python3.x多处理循环:” - python3.x multiprocessing cycling without “if __name__ == '__main__':” 如果 __name__ == '__main__' python - if __name__ == '__main__' python 使用 sphinx 记录 python 脚本条目 (__name__ == '__main__') - Documenting python script entry (__name__ == '__main__') using sphinx 如果 __name__ == "__main__": - If __name__ == "__main__": emacs 中的 Python:__name__ == '__main__',但不知何故 - Python in emacs: __name__ == '__main__', but somehow not 在 unix 下的 python 多处理中省略“if __name__ == '__main__'”语句是否安全? - is it safe to leave out "if __name__ == '__main__'" statement for multiprocessing in python under unix? Python 为什么必须在 __name__ == '__main__' 子句中调用 multiprocessing.set_start_method? - Python Why must multiprocessing.set_start_method be invoked in the __name__ == '__main__' clause? 在Python multiprocessing.Process中,我们是否必须使用`__name__ == __main__`? - In Python multiprocessing.Process , do we have to use `__name__ == __main__`?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM