简体   繁体   English

合并多个Spark数据帧

[英]union multiple spark dataframes

I have about 10,000 different Spark Dataframes that needs to be merged using union , but the union takes a very long time. 我有一个需要使用合并约1万种不同的Spark Dataframes union ,但union花费很长的时间。

Below is a brief sample of the code I ran, dfs is a collection of the Dataframes that I'd like to use union on: 以下是我运行的代码的简要示例,dfs是我要在其上使用union的Dataframe的集合:

from functools import reduce
from pyspark.sql import DataFrame

dfOut = reduce(DataFrame.unionAll, dfs)

It seems that when I union 100-200 dataframes, it is quite fast. 看来,当我合并100-200个数据帧时,它的速度相当快。 But the running time increases exponentially when I increase the number of dataframes to merge. 但是,当我增加要合并的数据帧的数量时,运行时间将成倍增加。

Any suggestions on improving the efficiency? 对提高效率有什么建议吗? Thanks a lot! 非常感谢!

The detail of this issue is available at https://issues.apache.org/jira/browse/SPARK-12616 . 有关此问题的详细信息,请访问https://issues.apache.org/jira/browse/SPARK-12616

Union logical plan is a binary node. 联合逻辑计划是一个二进制节点。 However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). 但是,联合的典型用例是联合大量输入源(DataFrame,RDD或文件)。 It is not uncommon to union hundreds of thousands of files. 合并成千上万个文件并不罕见。 In this case, our optimizer can become very slow due to the large number of logical unions. 在这种情况下,由于大量的逻辑联合,我们的优化器可能会变得非常慢。 We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer (or analyzer?) to collapse all adjacent Unions into one. 我们应该更改联盟逻辑计划以支持任意数量的子代,并在优化器(或分析器?)中添加一条规则,以将所有相邻联盟合并为一个。

Note that this problem doesn't exist in the physical plan, because the physical Union already supports an arbitrary number of children. 请注意,物理计划中不存在此问题,因为物理联合已支持任意数量的子代。

This was fixed in version 2.0.0. 此问题已在2.0.0版中修复。 If you have to use a version lower than 2.0.0, union the data using RDDs union function. 如果必须使用低于2.0.0的版本,请使用RDD联合函数联合数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM