Python 和 Dask - 读取和连接多个文件

Question

I have some parquet files, all coming from the same domain but with some differences in structure.我有一些parquet文件，它们都来自同一个域，但结构有所不同。 I need to concatenate all of them.我需要连接所有这些。 Below some example of these files:下面是这些文件的一些示例：

file 1:
A,B
True,False
False,False

file 2:
A,C
True,False
False,True
True,True

What I am looking to do is to read and concatenate these files in the fastest way possible obtaining the following result:我要做的是以最快的方式读取和连接这些文件，获得以下结果：

A,B,C
True,False,NaN
False,False,NaN
True,NaN,False
False,NaN,True
True,NaN,True

To do that I am using the following code, extracted using ( Reading multiple files with Dask , Dask dataframes: reading multiple files & storing filename in column ):为此，我使用以下代码，使用 ( Reading multiple files with Dask , Dask dataframes: reading multiple files & storage filename in column ) 提取：

import glob

import dask.dataframe as dd
from dask.distributed import Client
import dask

def read_parquet(path):
    return pd.read_parquet(path)

if __name__=='__main__':

    files = glob.glob('test/*/file.parquet')

    print('Start dask client...')
    client = Client()

    results = [dd.from_delayed(dask.delayed(read_parquet)(diag)) for diag in diag_files]

    results = dd.concat(results).compute()

    client.close()

This code works, and it is already the fastest version I could come up with (I tried sequential pandas and multiprocessing.Pool ).此代码有效，它已经是我能想到的最快版本（我尝试了顺序pandas和multiprocessing.Pool ）。 My idea was that Dask could ideally start part of the concatenation while still reading some of the files, however, from the task graph I see some sequential reading of the metadata of each parquet file, see the screenshot below:我的想法是，Dask 可以理想地开始连接的一部分，同时仍然读取一些文件，但是，从任务图中，我看到对每个 parquet 文件的元数据进行了一些顺序读取，请参见下面的屏幕截图：

The first part of the task graph is a mixture of read_parquet followed by read_metadata .任务图的第一部分是read_parquet和read_metadata的混合。 The first part always shows only 1 task executed (in the task processing tab).第一部分始终只显示执行的 1 个任务（在任务处理选项卡中）。 The second part is a combination of from_delayed and concat and it is using all of my workers.第二部分是from_delayed和concat的组合，它使用了我所有的工人。

Any suggestion on how to speed up the file reading and reduce the execution time of the first part of the graph?关于如何加快文件读取并减少图表第一部分的执行时间的任何建议？

Answer 1

The problem with your code is that you use Pandas version of read_parquet .您的代码的问题是您使用Pandas版本的read_parquet 。

Instead use:而是使用：

dask version of read_parquet , read_parquet的dask版本，
map and gather methods offered by Client , map和客户提供的收集方法，
dask version of concat , concat的dask版本，

Something like:就像是：

def read_parquet(path):
    return dd.read_parquet(path)

def myRead():
    L = client.map(read_parquet, glob.glob('file_*.parquet'))
    lst = client.gather(L)
    return dd.concat(lst)

result = myRead().compute()

Before that I created a client , once only.在此之前，我创建了一个客户端，只有一次。 The reason was that during my earlier experiments I got an error message when I attempted to create it again (in a function), even though the first instance has been closed before.原因是在我早期的实验中，当我尝试再次创建它时（在函数中），我收到一条错误消息，即使第一个实例之前已经关闭。

Python 和 Dask - 读取和连接多个文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-23 21:16:13

Python 和 Dask - 读取和连接多个文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-23 21:16:13

解决方案1
1 已采纳 2020-04-23 21:16:13