简体   繁体   English

为什么在过滤后的 Dask 数据帧上运行计算()需要这么长时间?

[英]Why does running compute() on a filtered Dask dataframe take so long?

I'm reading in data using this: ddf1 = dd.read_sql_table('mytable', conn_string, index_col='id', npartitions=8)我正在使用这个读取数据: ddf1 = dd.read_sql_table('mytable', conn_string, index_col='id', npartitions=8)

Of course, this runs instantaneously due to lazy computation.当然,由于延迟计算,这会立即运行。 This table has several hundred million rows.这个表有几亿行。

Next, I want to filter this Dask dataframe:接下来,我想过滤这个 Dask 数据框:

ddf2 = ddf1.query('some_col == "converted"')

Finally, I want to convert this to a Pandas dataframe.最后,我想将其转换为 Pandas 数据帧。 The result should only be about 8000 rows:结果应该只有大约 8000 行:

ddf3 = ddf2.compute()

However, this is taking very long (~1 hour).但是,这需要很长时间(约 1 小时)。 Can I get any advice on how to substantially speed this up?我能否就如何大幅加快速度获得任何建议? I've tried using .compute(scheduler='threads') , changing up the number of partitions, but none have worked so far.我试过使用.compute(scheduler='threads') ,改变分区的数量,但到目前为止都没有工作。 What am I doing wrong?我究竟做错了什么?

Firstly, you may be able to use sqlalchemy expression syntax to encode your filter clause in the query, and do the filtering server-side.首先,您可以使用 sqlalchemy 表达式语法对查询中的过滤器子句进行编码,并在服务器端进行过滤。 If data transfer is your bottleneck, than that is your best solution, especially is the filter column is indexed.如果数据传输是您的瓶颈,那么这就是您的最佳解决方案,尤其是过滤器列已编入索引。

Depending on your DB backend, sqlalchemy probably does not release the GIL, so your partitions cannot run in parallel in threads.根据您的数据库后端,sqlalchemy 可能不会释放 GIL,因此您的分区无法在线程中并行运行。 All you are getting is contention between the threads and extra overhead.你得到的只是线程之间的争用和额外的开销。 You should use the distributed scheduler with processes.您应该对进程使用 分布式调度程序。

Of course, please look at your CPU and memory usage;当然,看看你的CPU和内存使用情况; with the distributed scheduler, you also have access to the diagnostic dashboard.使用分布式调度程序,您还可以访问诊断仪表板。 You should also be concerned with how big each partition will be in memory.您还应该关心每个分区在内存中的大小。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM