[英]Python and Dask - reading and concatenating multiple files
I have some parquet
files, all coming from the same domain but with some differences in structure.我有一些
parquet
文件,它们都来自同一个域,但结构有所不同。 I need to concatenate all of them.我需要连接所有这些。 Below some example of these files:
下面是这些文件的一些示例:
file 1:
A,B
True,False
False,False
file 2:
A,C
True,False
False,True
True,True
What I am looking to do is to read and concatenate these files in the fastest way possible obtaining the following result:我要做的是以最快的方式读取和连接这些文件,获得以下结果:
A,B,C
True,False,NaN
False,False,NaN
True,NaN,False
False,NaN,True
True,NaN,True
To do that I am using the following code, extracted using ( Reading multiple files with Dask , Dask dataframes: reading multiple files & storing filename in column ):为此,我使用以下代码,使用 ( Reading multiple files with Dask , Dask dataframes: reading multiple files & storage filename in column ) 提取:
import glob
import dask.dataframe as dd
from dask.distributed import Client
import dask
def read_parquet(path):
return pd.read_parquet(path)
if __name__=='__main__':
files = glob.glob('test/*/file.parquet')
print('Start dask client...')
client = Client()
results = [dd.from_delayed(dask.delayed(read_parquet)(diag)) for diag in diag_files]
results = dd.concat(results).compute()
client.close()
This code works, and it is already the fastest version I could come up with (I tried sequential pandas
and multiprocessing.Pool
).此代码有效,它已经是我能想到的最快版本(我尝试了顺序
pandas
和multiprocessing.Pool
)。 My idea was that Dask could ideally start part of the concatenation while still reading some of the files, however, from the task graph I see some sequential reading of the metadata of each parquet file, see the screenshot below:我的想法是,Dask 可以理想地开始连接的一部分,同时仍然读取一些文件,但是,从任务图中,我看到对每个 parquet 文件的元数据进行了一些顺序读取,请参见下面的屏幕截图:
The first part of the task graph is a mixture of read_parquet
followed by read_metadata
.任务图的第一部分是
read_parquet
和read_metadata
的混合。 The first part always shows only 1 task executed (in the task processing tab).第一部分始终只显示执行的 1 个任务(在任务处理选项卡中)。 The second part is a combination of
from_delayed
and concat
and it is using all of my workers.第二部分是
from_delayed
和concat
的组合,它使用了我所有的工人。
Any suggestion on how to speed up the file reading and reduce the execution time of the first part of the graph?关于如何加快文件读取并减少图表第一部分的执行时间的任何建议?
The problem with your code is that you use Pandas version of read_parquet .您的代码的问题是您使用Pandas版本的read_parquet 。
Instead use:而是使用:
Something like:就像是:
def read_parquet(path):
return dd.read_parquet(path)
def myRead():
L = client.map(read_parquet, glob.glob('file_*.parquet'))
lst = client.gather(L)
return dd.concat(lst)
result = myRead().compute()
Before that I created a client , once only.在此之前,我创建了一个客户端,只有一次。 The reason was that during my earlier experiments I got an error message when I attempted to create it again (in a function), even though the first instance has been closed before.
原因是在我早期的实验中,当我尝试再次创建它时(在函数中),我收到一条错误消息,即使第一个实例之前已经关闭。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.