在 HPC 集群中創建 Dask LocalCluster 實例時，SLURM 任務失敗

Question

我正在使用命令sbatch和下一個配置對任務進行排隊：

#SBATCH --job-name=dask-test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=80G
#SBATCH --time=00:30:00
#SBATCH --tmp=10G
#SBATCH --partition=normal
#SBATCH --qos=normal

python ./dask-test.py

python 腳本或多或少如下：

import pandas as pd
import dask.dataframe as dd
import numpy as np

from dask.distributed import Client, LocalCluster

print("Generating LocalCluster...")
cluster = LocalCluster()
print("Generating Client...")
client = Client(cluster, processes=False)

print("Scaling client...")
client.scale(8)

data = dd.read_csv(
    BASE_DATA_SOURCE + '/Data-BIGDATFILES-*.csv',
    delimiter=';',
)

def get_min_dt():
    min_dt = data.datetime.min().compute()
    print("Min is {}".format())

print("Getting min dt...")
get_min_dt()

第一個問題是文本“Generating LocalCluster...”被打印了 6 次，這讓我想知道腳本是否同時運行了多次。 其次，在沒有打印幾分鍾后，我收到以下消息：

/anaconda3/lib/python3.7/site-packages/distributed/node.py:155: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 37396 instead
  http_address["port"], self.http_server.port

很多次..最后是下一個，也是很多次：

Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /cluster/home/user/anaconda3/lib/python3.7/asyncio/tasks.py:592> exception=RuntimeError('\n        An attempt has been made to start a new process before the\n        current process has finished its bootstrapping phase.\n\n        This probably means that you are not using fork to start your\n        child processes and you have forgotten to use the proper idiom\n        in the main module:\n\n            if __name__ == \'__main__\':\n                freeze_support()\n                ...\n\n        The "freeze_support()" line can be omitted if the program\n        is not going to be frozen to produce an executable.')>
Traceback (most recent call last):
  File "/cluster/home/user/anaconda3/lib/python3.7/asyncio/tasks.py", line 599, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/core.py", line 290, in _
    await self.start()
  File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 295, in start
    response = await self.instantiate()
  File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 378, in instantiate
    result = await self.process.start()
  File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 575, in start
    await self.process.start()
  File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 34, in _call_and_set_future
    res = func(*args, **kwargs)
  File "/cluster/home/user/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 202, in _start
    process.start()
  File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "/cluster/home/user/anaconda3/lib/python3.7/multiprocessing/spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

我已經嘗試添加更多內核，更多 memory，在Client實例化時設置processes=False以及許多其他內容，但我無法弄清楚問題所在。

使用的庫/軟件版本是：

Python 3.7
Pandas 1.0.5
達斯克 2.19.0
悶悶不樂 17.11.7

我設置錯了嗎？ 使用本地集群和客戶端結構的方式是否正確？

Answer 1

經過一番研究，我可以得到一個解決方案。 不太確定原因，但非常確定它有效。

LocalCluster、Client 及其之后的所有代碼（將被分發執行的代碼）的實例化不得位於 Python 腳本的模塊級別。 相反，此代碼必須在方法中或 __main__ 塊內，如下所示：

import pandas as pd
import dask.dataframe as dd
import numpy as np

from dask.distributed import Client, LocalCluster


if __name__ == "__main__":
    print("Generating LocalCluster...")
    cluster = LocalCluster()
    print("Generating Client...")
    client = Client(cluster, processes=False)

    print("Scaling client...")
    client.scale(8)

    data = dd.read_csv(
        BASE_DATA_SOURCE + '/Data-BIGDATFILES-*.csv',
        delimiter=';',
    )

    def get_min_dt():
        min_dt = data.datetime.min().compute()
        print("Min is {}".format())

    print("Getting min dt...")
    get_min_dt()

這個簡單的改變會有所作為。 在該問題線程中找到了解決方案： https://github.com/dask/distributed/issues/2520#issuecomment-470817810

在 HPC 集群中創建 Dask LocalCluster 實例時，SLURM 任務失敗

問題描述

1 個解決方案

解決方案1
1 已采納 2020-07-10 16:09:24

在 HPC 集群中創建 Dask LocalCluster 實例時，SLURM 任務失敗

問題描述

1 個解決方案

解決方案1 1 已采納 2020-07-10 16:09:24

解決方案1
1 已采納 2020-07-10 16:09:24