简体   繁体   English

Python 和 Dask - 读取和连接多个文件

[英]Python and Dask - reading and concatenating multiple files

I have some parquet files, all coming from the same domain but with some differences in structure.我有一些parquet文件,它们都来自同一个域,但结构有所不同。 I need to concatenate all of them.我需要连接所有这些。 Below some example of these files:下面是这些文件的一些示例:

file 1:
A,B
True,False
False,False

file 2:
A,C
True,False
False,True
True,True

What I am looking to do is to read and concatenate these files in the fastest way possible obtaining the following result:我要做的是以最快的方式读取和连接这些文件,获得以下结果:

A,B,C
True,False,NaN
False,False,NaN
True,NaN,False
False,NaN,True
True,NaN,True

To do that I am using the following code, extracted using ( Reading multiple files with Dask , Dask dataframes: reading multiple files & storing filename in column ):为此,我使用以下代码,使用 ( Reading multiple files with Dask , Dask dataframes: reading multiple files & storage filename in column ) 提取:

import glob

import dask.dataframe as dd
from dask.distributed import Client
import dask

def read_parquet(path):
    return pd.read_parquet(path)

if __name__=='__main__':

    files = glob.glob('test/*/file.parquet')

    print('Start dask client...')
    client = Client()

    results = [dd.from_delayed(dask.delayed(read_parquet)(diag)) for diag in diag_files]

    results = dd.concat(results).compute()

    client.close()

This code works, and it is already the fastest version I could come up with (I tried sequential pandas and multiprocessing.Pool ).此代码有效,它已经是我能想到的最快版本(我尝试了顺序pandasmultiprocessing.Pool )。 My idea was that Dask could ideally start part of the concatenation while still reading some of the files, however, from the task graph I see some sequential reading of the metadata of each parquet file, see the screenshot below:我的想法是,Dask 可以理想地开始连接的一部分,同时仍然读取一些文件,但是,从任务图中,我看到对每个 parquet 文件的元数据进行了一些顺序读取,请参见下面的屏幕截图: 任务图仪表板

The first part of the task graph is a mixture of read_parquet followed by read_metadata .任务图的第一部分是read_parquetread_metadata的混合。 The first part always shows only 1 task executed (in the task processing tab).第一部分始终只显示执行的 1 个任务(在任务处理选项卡中)。 The second part is a combination of from_delayed and concat and it is using all of my workers.第二部分是from_delayedconcat的组合,它使用了我所有的工人。

Any suggestion on how to speed up the file reading and reduce the execution time of the first part of the graph?关于如何加快文件读取并减少图表第一部分的执行时间的任何建议?

The problem with your code is that you use Pandas version of read_parquet .您的代码的问题是您使用Pandas版本的read_parquet

Instead use:而是使用:

  • dask version of read_parquet , read_parquetdask版本,
  • map and gather methods offered by Client , map客户提供的收集方法,
  • dask version of concat , concatdask版本,

Something like:就像是:

def read_parquet(path):
    return dd.read_parquet(path)

def myRead():
    L = client.map(read_parquet, glob.glob('file_*.parquet'))
    lst = client.gather(L)
    return dd.concat(lst)

result = myRead().compute()

Before that I created a client , once only.在此之前,我创建了一个客户端,只有一次。 The reason was that during my earlier experiments I got an error message when I attempted to create it again (in a function), even though the first instance has been closed before.原因是在我早期的实验中,当我尝试再次创建它时(在函数中),我收到一条错误消息,即使第一个实例之前已经关闭。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM