简体   繁体   English

如何将 dask 数据框保存到与 dask sheduler/workers 相同的机器上的镶木地板上?

[英]How to save dask dataframe to parquet on same machine as dask sheduler/workers?

I'm trying to save by Dask Dataframe to parquet on the same machine as the dask scheduler/workers are located.我正在尝试通过 Dask Dataframe 保存到与 dask 调度程序/工作人员所在的同一台机器上的镶木地板上。 However, I have trouble during this.但是,我在这期间遇到了麻烦。

My Dask setup : My python script is executed on my local machine (laptop 16 GB RAM), but the script creates a Dask client to a Dask scheduler running on a remote machine (a server with 400 GB RAM for parallel computations).我的 Dask 设置:我的 python 脚本在我的本地机器(笔记本电脑 16 GB RAM)上执行,但该脚本为在远程机器(具有 400 GB RAM 用于并行计算的服务器)上运行的 Dask 调度程序创建了一个 Dask 客户端。 The Dask scheduler and workers are all located on the same server, thus they all share the same file system, locally available to them. Dask 调度器和工作器都位于同一台服务器上,因此它们都共享相同的文件系统,本地可用。 As this remote Dask scheduler is used by all members of my team, the files we are working on are also located on the same server, providing common access to all members to all files through the same Dask cluster.由于我团队的所有成员都使用这个远程 Dask 调度程序,我们正在处理的文件也位于同一台服务器上,通过同一个 Dask 集群为所有成员提供对所有文件的公共访问。

I have tried:我试过了:

# This saves the parquet files in a folder on my local machine.
ddf.to_parquet(
    '/scratch/dataset_no_dalayed', compression='brotli').compute()

# This delayed call of `ddf.to_parquet` saves the Dask Dataframe chucks
# into individual parquet files (i.e. parts) in the given folder.
# However, I want to persist the Dask dataframe in my workflow, but this
# fails as seen below.
dask.delayed(ddf.to_parquet)(
    '/scratch/dataset_dalayed', compression='brotli').compute()

# If the Dask dataframe is persisted, the `to_parquet` fails with
# a "KilledWorker" error!
ddf = client.persist(ddf)
dask.delayed(ddf.to_parquet)(
    '/scratch/dataset_persist/', compression='brotli').compute()

# In the example below, I can NOT save the Dask dataframe.
# Because the delayed function makes the Dask dataframe
# to a Pandas dataframe on runtime. And this fails as the path is a
# folder and not at file as Pandas requires!
@dask.delayed
def save(new_ddf):
    new_ddf.to_parquet('/scratch/dataset_function/', compression='brotli')

save(ddf).compute()

How to do this correct?如何正确地做到这一点?

Usually to save a dask dataframe as a parquet dataset people do the following:通常要将 dask 数据框保存为镶木地板数据集,人们会执行以下操作:

df.to_parquet(...)

From your question it sounds like your workers may not all have access to a shared file system like NFS, or S3.从您的问题来看,您的工作人员可能并非都有权访问 NFS 或 S3 等共享文件系统。 If this is the case and you store to local drives then your data will be scattered on various machines without an obvious way to collect them together.如果是这种情况并且您存储到本地驱动器,那么您的数据将分散在各种机器上,而没有明显的方法将它们收集在一起。 In principle, I encouarage you to avoid this, and invest in a shared file system.原则上,我鼓励您避免这种情况,并投资于共享文件系统。 They are very helpful when doing distributed computing.它们在进行分布式计算时非常有用。

If you can't do that then I personally would probably write in parallel to local drives and then scp them over back to one machine afterwards.如果您不能这样做,那么我个人可能会与本​​地驱动器并行写入,然后将它们 scp 传输回一台机器。

If your dataset is small enough then you could also call .compute to get back to a local Pandas dataframe and then write that using Pandas如果您的数据集足够小,那么您还可以调用.compute以返回本地 Pandas 数据帧,然后使用 Pandas 编写该数据帧

df.compute().to_parquet(...)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM