简体   繁体   English

在Dask中重用中间结果(混合延迟和dask.dataframe)

[英]Re-using intermediate results in Dask (mixing delayed and dask.dataframe)

Based on the answer I had received on an earlier question , I have written an ETL procedure that looks as follows: 根据我在之前的一个问题上得到的答案,我编写了一个ETL程序,如下所示:

import pandas as pd
from dask import delayed
from dask import dataframe as dd

def preprocess_files(filename):
    """Reads file, collects metadata and identifies lines not containing data.
    """
    ...
    return filename, metadata, skiprows

def load_file(filename, skiprows):
    """Loads the file into a pandas dataframe, skipping lines not containing data."""
    return df

def process_errors(filename, skiplines):
    """Calculates error metrics based on the information 
    collected in the pre-processing step
    """
    ...

def process_metadata(filename, metadata):
    """Analyses metadata collected in the pre-processing step."""
    ...

values = [delayed(preprocess_files)(fn) for fn in file_names]
filenames = [value[0] for value in values]
metadata = [value[1] for value in values]
skiprows = [value[2] for value in values]

error_results = [delayed(process_errors)(arg[0], arg[1]) 
                 for arg in zip(filenames, skiprows)]
meta_results = [delayed(process_metadata)(arg[0], arg[1]) 
                for arg in zip(filenames, metadata)]

dfs = [delayed(load_file)(arg[0], arg[1]) 
       for arg in zip(filenames, skiprows)]
... # several delayed transformations defined on individual dataframes

# finally: categorize several dataframe columns and write them to HDF5
dfs = dd.from_delayed(dfs, meta=metaframe)
dfs.categorize(columns=[...])  # I would like to delay this
dfs.to_hdf(hdf_file_name, '/data',...)  # I would also like to delay this

all_operations = error_results + meta_results # + delayed operations on dask dataframe
# trigger all computation at once, 
# allow re-using of data collected in the pre-processing step.
dask.compute(*all_operations)

The ETL-process goes through several steps: ETL过程经历了几个步骤:

  1. Pre-process the files, identify lines which do not include any relevant data and parse metadata 预处理文件,识别不包含任何相关数据的行并解析元数据
  2. Using information gathered, process error information, metadata and load data-lines into pandas dataframes in parallel (re-using the results from the pre-processing step). 使用收集的信息,处理错误信息,元数据并将数据线并行加载到pandas数据帧中(重新使用预处理步骤的结果)。 The operations ( process_metadata , process_errors , load_file ) have a shared data dependency in that they all use information gathered in the pre-processing step. 操作( process_metadataprocess_errorsload_file )具有共享数据依赖性,因为它们都使用在预处理步骤中收集的信息。 Ideally, the pre-processing step would only be run once and the results shared across processes. 理想情况下,预处理步骤只运行一次,结果跨进程共享。
  3. eventually, collect the pandas dataframes into a dask dataframe, categorize them and write them to hdf. 最终,将pandas数据帧收集到一个dask数据帧中,对它们进行分类并将它们写入hdf。

The problem I am having with this is, that categorize and to_hdf trigger computation immediately, discarding metadata and error-data which otherwise would be further processed by process_errors and process_metadata . 我遇到的问题是, categorizeto_hdf立即触发计算,丢弃元数据和错误数据,否则将由process_errorsprocess_metadata进一步处理。

I have been told that delaying operations on dask.dataframes can cause problems, which is why I would be very interested to know whether it is possible to trigger the entire computation (processing metadata, processing errors, loading dataframes, transforming dataframes and storing them in HDF format) at once, allowing the different processes to share the data collected in the pre-processing phase. 我被告知延迟dask.dataframes上的dask.dataframes可能会导致问题,这就是为什么我会非常有兴趣知道是否有可能触发整个计算(处理元数据,处理错误,加载数据帧,转换数据帧并将它们存储在HDF格式),允许不同的进程共享在预处理阶段收集的数据。

There are two ways to approach your problem: 有两种方法可以解决您的问题:

  1. Delay everything 延迟一切
  2. Compute in stages 分阶段计算

Delay Everything 延迟一切

The to_hdf call accepts a compute= keyword argument that you can set to False. to_hdf调用接受可以设置为False的compute= keyword参数。 If False it will hand you back a dask.delayed value that you can compute whenever you feel like it. 如果为False,它将返回一个dask.delayed值,您可以随时计算它。

The categorize call however does need to be computed immediately if you want to keep using dask.dataframe. 但是,如果要继续使用dask.dataframe,则需要立即计算分类调用。 We're unable to create a consistent dask.dataframe without going through the data more-or-less immediately. 我们无法在不立即查看数据的情况下创建一致的dask.dataframe。 Recent improvements in Pandas around unioning categoricals will let us change this in the future, but for now you're stuck. 最近关于联盟分类的Pandas的改进将让我们在将来改变它,但是现在你被困住了。 If this is a blocker for you then you'll have to switch down to dask.delayed and handle things manually for a bit with df.to_delayed() 如果这是一个阻止你,那么你将不得不切换到dask.delayed并使用df.to_delayed()手动处理一些事情

Compute in Stages 在阶段计算

If you use the distributed scheduler you can stage your computation by using the .persist method . 如果使用分布式调度程序 ,则可以使用.persist方法进行计算。

from dask.distributed import Executor
e = Executor()  # make a local "cluster" on your laptop

delayed_values = e.persist(*delayed_values)

... define further computations on delayed values ...

results = dask.compute(results)  # compute as normal

This will let you trigger some computations and still let you proceed onwards defining your computation. 这将让您触发一些计算,仍然可以继续定义您的计算。 The values that you persist will stay in memory. 您持久保存的值将保留在内存中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM