[英]Dask DataFrame Coiled KilledWorker read_sql
I'm trying to run a Dask
cluster alongside a Dash
app to analyze very large data sets.我正在尝试与
Dash
应用程序一起运行Dask
集群来分析非常大的数据集。 I'm able to run a LocalCluster
successfully and the Dask DataFrame computations occur successfully.我能够成功运行
LocalCluster
,并且 Dask DataFrame 计算成功发生。 The Dash
app is started using the following gunicorn
command:使用以下
gunicorn
命令启动Dash
应用程序:
Unfortunately, my issues occur when I try and move the cluster to coiled
.不幸的是,当我尝试将集群移动到
coiled
时,我的问题出现了。
coiled.create_software_environment(
name="my-conda-env",
conda={
"channels": ["conda-forge", "defaults"],
"dependencies": ["dask", "dash"],
},
)
coiled.create_cluster_configuration(
name="my-cluster-config",
scheduler_cpu=1,
scheduler_memory="1 GiB",
worker_cpu=2,
worker_memory="1 GiB",
software="my-conda-env"
)
cluster = coiled.Cluster(n_workers=2)
CLIENT = Client(cluster)
dd_bills_df = dd.read_sql_table(
table, conn_string, npartitions=10, index_col='DB_BillID'
)
CLIENT.publish_dataset(bills=dd_bills_df)
del dd_bills_df
log.debug(CLIENT.list_datasets())
x = CLIENT.get_dataset('bills').persist()
log.debug(x.groupby('BillType').count().compute())
The cluster is created and the data set is successfully published to the cluster and then the dataset is successfully pulled by the client into the variable x
.创建集群并成功将数据集发布到集群,然后客户端成功将数据集拉入变量
x
。 The problem occurs during the groupby()
calculation.该问题发生在
groupby()
计算期间。
[2021-12-03 17:40:30 -0600] [78928] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
worker.init_process()
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 134, in init_process
self.load_wsgi()
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
self.wsgi = self.app.wsgi()
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
return self.load_wsgiapp()
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
return util.import_app(self.app_uri)
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/gunicorn/util.py", line 359, in import_app
mod = importlib.import_module(module)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 855, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/Users/leowotzak/PenHole/test-containers2/src/application.py", line 61, in <module>
log.debug(x.groupby('BillType').count().compute())
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/dask/base.py", line 288, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/dask/base.py", line 571, in compute
results = schedule(dsk, keys, **kwargs)
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/client.py", line 2725, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/client.py", line 1980, in gather
return self.sync(
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/client.py", line 868, in sync
return sync(
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/utils.py", line 332, in sync
raise exc.with_traceback(tb)
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/utils.py", line 315, in f
result[0] = yield future
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/Users/leowotzak/PenHole/test-containers2/venv/lib/python3.9/site-packages/distributed/client.py", line 1845, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ('_read_sql_chunk-5519d13b-b80d-468e-afd5-5f072b9adbec', <WorkerState 'tls://10.4.27.61:40673', name: coiled-dask-leowotzc2-75566-worker-6a0538671d, status: closed, memory: 0, processing: 10>)
This is the log output prior to the crash:这是崩溃前的日志 output:
DEBUG:application:Dask DataFrame Structure:
BillName BillType ByRequest Congress EnactedAs IntroducedAt BillNumber OfficialTitle PopularTitle ShortTitle CurrentStatus BillSubjectTopTerm URL TextURL DB_LastModDate DB_CreatedDate
npartitions=10
1.0 object object int64 object object datetime64[ns] object object object object object object object object datetime64[ns] datetime64[ns]
2739.9 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24651.1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27390.0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Dask Name: from-delayed, 20 tasks
DEBUG:application:('bills',)
I have tried increasing the memory allocated to each worker and the number of partitions in the Dask DataFrame to no avail.我尝试增加分配给每个工作人员的 memory 和 Dask DataFrame 中的分区数无济于事。 I'm struggling to figure out what is killing the workers, has anyone else run into this error?
我正在努力弄清楚是什么杀死了工人,还有其他人遇到这个错误吗?
If the dataset is very large, setting 1GB to workers and scheduler might be very constraining.如果数据集非常大,为 worker 和 scheduler 设置 1GB 可能会受到很大限制。 There are two options to try:
有两种选择可以尝试:
Set memory of worker and scheduler to a level comparable to your local machine.将 worker 和 scheduler 的 memory 设置为与本地机器相当的水平。
Try the coiled
version of the code on a fairly small subset of the table.在表的一个相当小的子集上尝试代码的
coiled
版本。
When doing groupby operations with large results , you can try the following:在进行结果较大的 groupby 操作时,可以尝试以下操作:
observed=True
so that only categories appearing in each group will be present in the results.observed=True
,以便只有出现在每个组中的类别才会出现在结果中。 This is a quirk of pandas groupby operations but can really blow results up in dask.dataframes.split_out=True
in your aggregation call if supported, eg df.groupby(large_set).mean(split_out=True)
.split_out=True
,例如df.groupby(large_set).mean(split_out=True)
。 By default, the result of groupby operations will return a single partition - split out is significantly slower but won't blow up your memory.df.map_partitions
df.map_partitions
减少每个分区中的数据大小作为预处理步骤The source of the error stems from misconfigured dask-worker
and dask-scheduler
software environments, unrelated to coiled
and the code sample in the original post.错误的根源在于配置错误
dask-worker
和dask-scheduler
软件环境,与原帖中的coiled
和代码示例无关。
The dask-scheduler
and dask-worker
processes were running in docker
containers on EC2 instances. dask-scheduler
和dask-worker
进程在 EC2 实例上的docker
容器中运行。 To initialize these processes, the following command was used:为了初始化这些进程,使用了以下命令:
sudo docker run -it --net=host daskdev/dask:latest dask-worker <host>:<port>
daskdev/dask
is defined as such in the documentation: daskdev/dask
在文档中是这样定义的:
This a normal debian + miniconda image with the full Dask conda package (including the distributed scheduler), Numpy, and Pandas.
This a normal debian + miniconda image with the full Dask conda package (including the distributed scheduler), Numpy, and Pandas. This image is about 1GB in size.
此图像大小约为 1GB。
The problem is, dask.dataframe.read_sql_table(...)
utilizes sqlalchemy
, and by extension a database drive, such as pymysql
.问题是,
dask.dataframe.read_sql_table(...)
利用sqlalchemy
,并通过扩展数据库驱动器,例如pymysql
。 These are not included in this base image.这些不包含在此基本映像中。 To solve this, the previous
docker run
command can be ammended with the following:为了解决这个问题,可以将之前的
docker run
命令修改为:
sudo docker run -it -e EXTRA_PIP_PACKAGES="sqlalchemy pymysql" --net=host daskdev/dask:latest dask-worker <host>:<port>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.