简体   繁体   English

hvplot 需要几个小时来渲染图像

[英]hvplot taking hours to render image

I'm working with Gaia astrometric data from the data release 3 and saw hvplot/datashader as the go-to for visualizing large data due to very fast render times and interactivity.我正在处理来自数据发布 3 的 Gaia 天体测量数据,并且由于非常快的渲染时间和交互性,将 hvplot/datashader 视为可视化大数据的首选。 In every example I'm seeing, it's taking a few seconds to render an image from hundreds of millions of data points on the slow end.在我看到的每个示例中,在慢速端从数亿个数据点渲染图像需要几秒钟。 However, when I try to employ the same code for my data, it takes hours for any image to render at all.但是,当我尝试为我的数据使用相同的代码时,渲染任何图像都需要数小时。

For context, I'm running this code on a very large research computer cluster with hundreds of gigs of RAM, a hundred or so cores, and terabytes of storage at my disposal, computing power should not be an issue here.对于上下文,我在一个非常大的研究计算机集群上运行此代码,该集群具有数百 GB 的 RAM、一百个左右的内核和 TB 的存储空间供我使用,计算能力在这里应该不是问题。 Additionally, I've converted the data I need to a series of parquet files that are being read into a dask dataframe with glob.此外,我已经将我需要的数据转换为一系列镶木地板文件,这些文件正在被读入带有 glob 的 dask dataframe。 My code is as follows:我的代码如下:

... ...

import dask.dataframe as dd
import hvplot.dask
import glob

df=dd.read_parquet(glob.glob(r'myfiles/*'),engine='fastparquet')
df=df.astype('float32')
df=df[['col1','col2']]
df.hvplot.scatter(x='col1',y='col2',rasterize=True,cmap=cc.fire)

... ...

does anybody have any ideas what could be the issue here?有没有人知道这里可能是什么问题? Any help would be appreciated任何帮助,将不胜感激

Edit: I've got the rendering times below an hour now by turning the data into a smaller number of higher memory files (3386 -> 175)编辑:通过将数据转换为数量较少的 memory 文件(3386 -> 175),我现在的渲染时间低于一个小时

Hard to debug without access to the data, but one quick optimization you can implement is to avoid loading all the data and select the specific columns of interest:无法访问数据很难进行调试,但您可以实施的一种快速优化是避免加载所有数据和 select 感兴趣的特定列:

df=dd.read_parquet(glob.glob(r'myfiles/*'), engine='fastparquet', columns=['col1','col2'])

Unless crucial, I'd also avoid doing .astype .除非至关重要,否则我也会避免.astype It shouldn't be a bottleneck, but the gains from this float32 might not be relevant if memory isn't a constraint.它不应该成为瓶颈,但如果 memory 不是约束,则此float32的收益可能无关紧要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM