union multiple spark dataframes

Question

I have about 10,000 different Spark Dataframes that needs to be merged using union , but the union takes a very long time.

Below is a brief sample of the code I ran, dfs is a collection of the Dataframes that I'd like to use union on:

from functools import reduce
from pyspark.sql import DataFrame

dfOut = reduce(DataFrame.unionAll, dfs)

It seems that when I union 100-200 dataframes, it is quite fast. But the running time increases exponentially when I increase the number of dataframes to merge.

Any suggestions on improving the efficiency? Thanks a lot!

Answer 1

The detail of this issue is available at https://issues.apache.org/jira/browse/SPARK-12616 .

Union logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer (or analyzer?) to collapse all adjacent Unions into one.

Note that this problem doesn't exist in the physical plan, because the physical Union already supports an arbitrary number of children.

This was fixed in version 2.0.0. If you have to use a version lower than 2.0.0, union the data using RDDs union function.

union multiple spark dataframes

Question

1 answers

solution1
2 2019-08-20 05:32:02

union multiple spark dataframes

Question

1 answers

solution1 2 2019-08-20 05:32:02

solution1
2 2019-08-20 05:32:02