简体   繁体   中英

hvplot taking hours to render image

I'm working with Gaia astrometric data from the data release 3 and saw hvplot/datashader as the go-to for visualizing large data due to very fast render times and interactivity. In every example I'm seeing, it's taking a few seconds to render an image from hundreds of millions of data points on the slow end. However, when I try to employ the same code for my data, it takes hours for any image to render at all.

For context, I'm running this code on a very large research computer cluster with hundreds of gigs of RAM, a hundred or so cores, and terabytes of storage at my disposal, computing power should not be an issue here. Additionally, I've converted the data I need to a series of parquet files that are being read into a dask dataframe with glob. My code is as follows:

...

import dask.dataframe as dd
import hvplot.dask
import glob

df=dd.read_parquet(glob.glob(r'myfiles/*'),engine='fastparquet')
df=df.astype('float32')
df=df[['col1','col2']]
df.hvplot.scatter(x='col1',y='col2',rasterize=True,cmap=cc.fire)

...

does anybody have any ideas what could be the issue here? Any help would be appreciated

Edit: I've got the rendering times below an hour now by turning the data into a smaller number of higher memory files (3386 -> 175)

Hard to debug without access to the data, but one quick optimization you can implement is to avoid loading all the data and select the specific columns of interest:

df=dd.read_parquet(glob.glob(r'myfiles/*'), engine='fastparquet', columns=['col1','col2'])

Unless crucial, I'd also avoid doing .astype . It shouldn't be a bottleneck, but the gains from this float32 might not be relevant if memory isn't a constraint.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM