简体繁体中英

Why does running compute() on a filtered Dask dataframe take so long?

原文 2020-03-16 20:10:27 7 1 python/ pandas/ parallel-processing/ dask/ dask-dataframe

I'm reading in data using this: ddf1 = dd.read_sql_table('mytable', conn_string, index_col='id', npartitions=8)

Of course, this runs instantaneously due to lazy computation. This table has several hundred million rows.

Next, I want to filter this Dask dataframe:

ddf2 = ddf1.query('some_col == "converted"')

Finally, I want to convert this to a Pandas dataframe. The result should only be about 8000 rows:

ddf3 = ddf2.compute()

However, this is taking very long (~1 hour). Can I get any advice on how to substantially speed this up? I've tried using .compute(scheduler='threads') , changing up the number of partitions, but none have worked so far. What am I doing wrong?

1 answers

Firstly, you may be able to use sqlalchemy expression syntax to encode your filter clause in the query, and do the filtering server-side. If data transfer is your bottleneck, than that is your best solution, especially is the filter column is indexed.

Depending on your DB backend, sqlalchemy probably does not release the GIL, so your partitions cannot run in parallel in threads. All you are getting is contention between the threads and extra overhead. You should use the distributed scheduler with processes.

Of course, please look at your CPU and memory usage; with the distributed scheduler, you also have access to the diagnostic dashboard. You should also be concerned with how big each partition will be in memory.

Why does my code take so long to write CSV file in Dask Python

Why does this line take so long to run?

Why does this take so long to match? Is it a bug?

Why does the program take so long to run?

Why does importing the openpyxl take so long?

Why does IDLE 3.4 take so long on this program?

Why does importing some SciPy modules take so long on Windows?

Why does populating my table take so long?

Why does it take so long to load filenames from a directory?

Why does my module with jit functions take so long to import?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Why does my code take so long to write CSV file in Dask Python Why does this line take so long to run? Why does this take so long to match? Is it a bug? Why does the program take so long to run? Why does importing the openpyxl take so long? Why does IDLE 3.4 take so long on this program? Why does importing some SciPy modules take so long on Windows? Why does populating my table take so long? Why does it take so long to load filenames from a directory? Why does my module with jit functions take so long to import?

Related Tags

Why does running compute() on a filtered Dask dataframe take so long?

Question

1 answers

solution1 1 2020-03-16 20:48:03

solution1
1 2020-03-16 20:48:03