Dask DataFrame Coiled KilledWorker read_sql

Question

I'm trying to run a Dask cluster alongside a Dash app to analyze very large data sets.我正在尝试与Dash应用程序一起运行Dask集群来分析非常大的数据集。 I'm able to run a LocalCluster successfully and the Dask DataFrame computations occur successfully.我能够成功运行LocalCluster ，并且 Dask DataFrame 计算成功发生。 The Dash app is started using the following gunicorn command:使用以下gunicorn命令启动Dash应用程序：

Unfortunately, my issues occur when I try and move the cluster to coiled .不幸的是，当我尝试将集群移动到coiled时，我的问题出现了。

coiled.create_software_environment(
    name="my-conda-env",
    conda={
        "channels": ["conda-forge", "defaults"],
        "dependencies": ["dask", "dash"],
    },
)

coiled.create_cluster_configuration(
    name="my-cluster-config",
    scheduler_cpu=1,
    scheduler_memory="1 GiB",
    worker_cpu=2,
    worker_memory="1 GiB",
    software="my-conda-env"
)

cluster = coiled.Cluster(n_workers=2)
CLIENT = Client(cluster)

dd_bills_df = dd.read_sql_table(
    table, conn_string, npartitions=10, index_col='DB_BillID'
)
CLIENT.publish_dataset(bills=dd_bills_df)
del dd_bills_df

log.debug(CLIENT.list_datasets())

x = CLIENT.get_dataset('bills').persist()
log.debug(x.groupby('BillType').count().compute())

The cluster is created and the data set is successfully published to the cluster and then the dataset is successfully pulled by the client into the variable x .创建集群并成功将数据集发布到集群，然后客户端成功将数据集拉入变量x 。 The problem occurs during the groupby() calculation.该问题发生在groupby()计算期间。

[2021-12-03 17:40:30 -0600] [78928] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.load_wsgi()
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/util.py", line 359, in import_app
    mod = importlib.import_module(module)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 855, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/Users/leowotzak/PenHole/test-containers2/src/application.py", line 61, in <module>
    log.debug(x.groupby('BillType').count().compute())
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/dask/base.py", line 288, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/dask/base.py", line 571, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/client.py", line 2725, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/client.py", line 1980, in gather
    return self.sync(
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/client.py", line 868, in sync
    return sync(
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync
    raise exc.with_traceback(tb)
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/utils.py", line 315, in f
    result[0] = yield future
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/client.py", line 1845, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ('_read_sql_chunk-5519d13b-b80d-468e-afd5-5f072b9adbec', <WorkerState 'tls://10.4.27.61:40673', name: coiled-dask-leowotzc2-75566-worker-6a0538671d, status: closed, memory: 0, processing: 10>)

This is the log output prior to the crash:这是崩溃前的日志 output：

DEBUG:application:Dask DataFrame Structure:
               BillName BillType ByRequest Congress EnactedAs    IntroducedAt BillNumber OfficialTitle PopularTitle ShortTitle CurrentStatus BillSubjectTopTerm     URL TextURL  DB_LastModDate  DB_CreatedDate
npartitions=10                                                                                                                                                                                                 
1.0              object   object     int64   object    object  datetime64[ns]     object        object       object     object        object             object  object  object  datetime64[ns]  datetime64[ns]
2739.9              ...      ...       ...      ...       ...             ...        ...           ...          ...        ...           ...                ...     ...     ...             ...             ...
...                 ...      ...       ...      ...       ...             ...        ...           ...          ...        ...           ...                ...     ...     ...             ...             ...
24651.1             ...      ...       ...      ...       ...             ...        ...           ...          ...        ...           ...                ...     ...     ...             ...             ...
27390.0             ...      ...       ...      ...       ...             ...        ...           ...          ...        ...           ...                ...     ...     ...             ...             ...
Dask Name: from-delayed, 20 tasks
DEBUG:application:('bills',)

I have tried increasing the memory allocated to each worker and the number of partitions in the Dask DataFrame to no avail.我尝试增加分配给每个工作人员的 memory 和 Dask DataFrame 中的分区数无济于事。 I'm struggling to figure out what is killing the workers, has anyone else run into this error?我正在努力弄清楚是什么杀死了工人，还有其他人遇到这个错误吗？

Answer 1

If the dataset is very large, setting 1GB to workers and scheduler might be very constraining.如果数据集非常大，为 worker 和 scheduler 设置 1GB 可能会受到很大限制。 There are two options to try:有两种选择可以尝试：

Set memory of worker and scheduler to a level comparable to your local machine.将 worker 和 scheduler 的 memory 设置为与本地机器相当的水平。
Try the coiled version of the code on a fairly small subset of the table.在表的一个相当小的子集上尝试代码的coiled版本。

Answer 2

When doing groupby operations with large results , you can try the following:在进行结果较大的 groupby 操作时，可以尝试以下操作：

If any of the columns are categorical, be sure to set observed=True so that only categories appearing in each group will be present in the results.如果任何列是分类的，请务必设置observed=True ，以便只有出现在每个组中的类别才会出现在结果中。 This is a quirk of pandas groupby operations but can really blow results up in dask.dataframes.这是 pandas groupby 操作的一个怪癖，但确实会在 dask.dataframes 中炸毁结果。 see related: Avoiding Memory Issues For GroupBy on Large Pandas DataFrame请参阅相关：避免大型 Pandas DataFrame 上 GroupBy 的 Memory 问题
Consider using split_out=True in your aggregation call if supported, eg df.groupby(large_set).mean(split_out=True) .如果支持，请考虑在聚合调用中使用split_out=True ，例如df.groupby(large_set).mean(split_out=True) 。 By default, the result of groupby operations will return a single partition - split out is significantly slower but won't blow up your memory.默认情况下，groupby 操作的结果将返回单个分区 - 拆分明显较慢，但不会炸毁您的 memory。 See related: dask dataframe groupby resulting in one partition memory issue请参阅相关： dask dataframe groupby 导致一个分区 memory 问题
If possible, consider reducing the size of your data within each partition as a preprocessing step using df.map_partitions如果可能，考虑使用df.map_partitions减少每个分区中的数据大小作为预处理步骤

Answer 3

Solution解决方案

The source of the error stems from misconfigured dask-worker and dask-scheduler software environments, unrelated to coiled and the code sample in the original post.错误的根源在于配置错误dask-worker和dask-scheduler软件环境，与原帖中的coiled和代码示例无关。

The dask-scheduler and dask-worker processes were running in docker containers on EC2 instances. dask-scheduler和dask-worker进程在 EC2 实例上的docker容器中运行。 To initialize these processes, the following command was used:为了初始化这些进程，使用了以下命令：

sudo docker run -it --net=host daskdev/dask:latest dask-worker <host>:<port>

daskdev/dask is defined as such in the documentation: daskdev/dask在文档中是这样定义的：

This a normal debian + miniconda image with the full Dask conda package (including the distributed scheduler), Numpy, and Pandas. This a normal debian + miniconda image with the full Dask conda package (including the distributed scheduler), Numpy, and Pandas. This image is about 1GB in size.此图像大小约为 1GB。

The problem is, dask.dataframe.read_sql_table(...) utilizes sqlalchemy , and by extension a database drive, such as pymysql .问题是， dask.dataframe.read_sql_table(...)利用sqlalchemy ，并通过扩展数据库驱动器，例如pymysql 。 These are not included in this base image.这些不包含在此基本映像中。 To solve this, the previous docker run command can be ammended with the following:为了解决这个问题，可以将之前的docker run命令修改为：

sudo docker run -it -e EXTRA_PIP_PACKAGES="sqlalchemy pymysql" --net=host daskdev/dask:latest dask-worker <host>:<port>

Dask DataFrame Coiled KilledWorker read_sql

问题描述

3 个解决方案

解决方案1
0 2021-12-04 03:10:11

解决方案2
0 2021-12-04 20:07:14

解决方案3
0 已采纳 2022-01-01 23:25:16

Solution解决方案

Dask DataFrame Coiled KilledWorker read_sql

问题描述

3 个解决方案

解决方案1 0 2021-12-04 03:10:11

解决方案2 0 2021-12-04 20:07:14

解决方案3 0 已采纳 2022-01-01 23:25:16

Solution解决方案

解决方案1
0 2021-12-04 03:10:11

解决方案2
0 2021-12-04 20:07:14

解决方案3
0 已采纳 2022-01-01 23:25:16