简体   繁体   中英

Why does running compute() on a filtered Dask dataframe take so long?

I'm reading in data using this: ddf1 = dd.read_sql_table('mytable', conn_string, index_col='id', npartitions=8)

Of course, this runs instantaneously due to lazy computation. This table has several hundred million rows.

Next, I want to filter this Dask dataframe:

ddf2 = ddf1.query('some_col == "converted"')

Finally, I want to convert this to a Pandas dataframe. The result should only be about 8000 rows:

ddf3 = ddf2.compute()

However, this is taking very long (~1 hour). Can I get any advice on how to substantially speed this up? I've tried using .compute(scheduler='threads') , changing up the number of partitions, but none have worked so far. What am I doing wrong?

Firstly, you may be able to use sqlalchemy expression syntax to encode your filter clause in the query, and do the filtering server-side. If data transfer is your bottleneck, than that is your best solution, especially is the filter column is indexed.

Depending on your DB backend, sqlalchemy probably does not release the GIL, so your partitions cannot run in parallel in threads. All you are getting is contention between the threads and extra overhead. You should use the distributed scheduler with processes.

Of course, please look at your CPU and memory usage; with the distributed scheduler, you also have access to the diagnostic dashboard. You should also be concerned with how big each partition will be in memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM