[英]Re-using intermediate results in Dask (mixing delayed and dask.dataframe)
Based on the answer I had received on an earlier question , I have written an ETL procedure that looks as follows: 根据我在之前的一个问题上得到的答案,我编写了一个ETL程序,如下所示:
import pandas as pd
from dask import delayed
from dask import dataframe as dd
def preprocess_files(filename):
"""Reads file, collects metadata and identifies lines not containing data.
"""
...
return filename, metadata, skiprows
def load_file(filename, skiprows):
"""Loads the file into a pandas dataframe, skipping lines not containing data."""
return df
def process_errors(filename, skiplines):
"""Calculates error metrics based on the information
collected in the pre-processing step
"""
...
def process_metadata(filename, metadata):
"""Analyses metadata collected in the pre-processing step."""
...
values = [delayed(preprocess_files)(fn) for fn in file_names]
filenames = [value[0] for value in values]
metadata = [value[1] for value in values]
skiprows = [value[2] for value in values]
error_results = [delayed(process_errors)(arg[0], arg[1])
for arg in zip(filenames, skiprows)]
meta_results = [delayed(process_metadata)(arg[0], arg[1])
for arg in zip(filenames, metadata)]
dfs = [delayed(load_file)(arg[0], arg[1])
for arg in zip(filenames, skiprows)]
... # several delayed transformations defined on individual dataframes
# finally: categorize several dataframe columns and write them to HDF5
dfs = dd.from_delayed(dfs, meta=metaframe)
dfs.categorize(columns=[...]) # I would like to delay this
dfs.to_hdf(hdf_file_name, '/data',...) # I would also like to delay this
all_operations = error_results + meta_results # + delayed operations on dask dataframe
# trigger all computation at once,
# allow re-using of data collected in the pre-processing step.
dask.compute(*all_operations)
The ETL-process goes through several steps: ETL过程经历了几个步骤:
process_metadata
, process_errors
, load_file
) have a shared data dependency in that they all use information gathered in the pre-processing step. process_metadata
, process_errors
, load_file
)具有共享数据依赖性,因为它们都使用在预处理步骤中收集的信息。 Ideally, the pre-processing step would only be run once and the results shared across processes. The problem I am having with this is, that categorize
and to_hdf
trigger computation immediately, discarding metadata and error-data which otherwise would be further processed by process_errors
and process_metadata
. 我遇到的问题是,
categorize
和to_hdf
立即触发计算,丢弃元数据和错误数据,否则将由process_errors
和process_metadata
进一步处理。
I have been told that delaying operations on dask.dataframes
can cause problems, which is why I would be very interested to know whether it is possible to trigger the entire computation (processing metadata, processing errors, loading dataframes, transforming dataframes and storing them in HDF format) at once, allowing the different processes to share the data collected in the pre-processing phase. 我被告知延迟
dask.dataframes
上的dask.dataframes
可能会导致问题,这就是为什么我会非常有兴趣知道是否有可能触发整个计算(处理元数据,处理错误,加载数据帧,转换数据帧并将它们存储在HDF格式),允许不同的进程共享在预处理阶段收集的数据。
There are two ways to approach your problem: 有两种方法可以解决您的问题:
The to_hdf call accepts a compute=
keyword argument that you can set to False. to_hdf调用接受可以设置为False的
compute=
keyword参数。 If False it will hand you back a dask.delayed
value that you can compute whenever you feel like it. 如果为False,它将返回一个
dask.delayed
值,您可以随时计算它。
The categorize call however does need to be computed immediately if you want to keep using dask.dataframe. 但是,如果要继续使用dask.dataframe,则需要立即计算分类调用。 We're unable to create a consistent dask.dataframe without going through the data more-or-less immediately.
我们无法在不立即查看数据的情况下创建一致的dask.dataframe。 Recent improvements in Pandas around unioning categoricals will let us change this in the future, but for now you're stuck.
最近关于联盟分类的Pandas的改进将让我们在将来改变它,但是现在你被困住了。 If this is a blocker for you then you'll have to switch down to
dask.delayed
and handle things manually for a bit with df.to_delayed()
如果这是一个阻止你,那么你将不得不切换到
dask.delayed
并使用df.to_delayed()
手动处理一些事情
If you use the distributed scheduler you can stage your computation by using the .persist
method . 如果使用分布式调度程序 ,则可以使用
.persist
方法进行计算。
from dask.distributed import Executor
e = Executor() # make a local "cluster" on your laptop
delayed_values = e.persist(*delayed_values)
... define further computations on delayed values ...
results = dask.compute(results) # compute as normal
This will let you trigger some computations and still let you proceed onwards defining your computation. 这将让您触发一些计算,仍然可以继续定义您的计算。 The values that you persist will stay in memory.
您持久保存的值将保留在内存中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.