简体   繁体   English

使用Python的reduce()连接多个PySpark DataFrame

[英]Using Python's reduce() to join multiple PySpark DataFrames

Does anyone know why using Python3's functools.reduce() would lead to worse performance when joining multiple PySpark DataFrames than just iteratively joining the same DataFrames using a for loop? 有谁知道为什么使用Python3的functools.reduce()会导致加入多个PySpark DataFrame时性能更差,而不仅仅是使用for循环迭代加入相同的DataFrame? Specifically, this gives a massive slowdown followed by an out-of-memory error: 具体来说,这会导致大量减速,然后出现内存不足错误:

def join_dataframes(list_of_join_columns, left_df, right_df):
    return left_df.join(right_df, on=list_of_join_columns)

joined_df = functools.reduce(
    functools.partial(join_dataframes, list_of_join_columns), list_of_dataframes,
)

whereas this one doesn't: 而这一个不是:

joined_df = list_of_dataframes[0]
joined_df.cache()
for right_df in list_of_dataframes[1:]:
    joined_df = joined_df.join(right_df, on=list_of_join_columns)

Any ideas would be greatly appreciated. 任何想法将不胜感激。 Thanks! 谢谢!

As long as you use CPython (different implementations can, but realistically shouldn't, exhibit significantly different behavior in this specific case). 只要您使用CPython(不同的实现可以,但实际上不应该,在这种特定情况下表现出明显不同的行为)。 If you take a look at reduce implementation you'll see it is just a for-loop with minimal exception handling. 如果你看看reduce实现,你会发现它只是一个for循环,只需要很少的异常处理。

The core is exactly equivalent to the loop you use 核心完全等同于您使用的循环

for element in it:
    value = function(value, element)

and there is no evidence supporting claims of any special behavior. 并且没有证据支持任何特殊行为的主张。

Additionally simple tests with number of frames practical limitations of Spark joins (joins are among the most expensive operations in Spark ) 此外,简单的测试具有帧数的实际限制Spark连接(连接 Spark 中最昂贵的操作

dfs = [
    spark.range(10000).selectExpr(
        "rand({}) AS id".format(i), "id AS value",  "{} AS loop ".format(i)
    )
    for i in range(200)
]

Show no significant difference in timing between direct for-loop 直接for循环之间的时序显示没有显着差异

def f(dfs):
    df1 = dfs[0]
    for df2 in dfs[1:]:
        df1 = df1.join(df2, ["id"])
    return df1

%timeit -n3 f(dfs)                 
## 6.25 s ± 257 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

and reduce invocation reduce调用

from functools import reduce

def g(dfs):
    return reduce(lambda x, y: x.join(y, ["id"]), dfs) 

%timeit -n3 g(dfs)
### 6.47 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

Similarly overall JVM behavior patterns are comparable between for-loop 类似地,整个JVM行为模式在for循环之间是可比较的

For loop CPU and Memory Usage - VisualVM 用于循环CPU和内存使用 - VisualVM

and reduce reduce

reduce CPU and Memory Usage - VisualVM 减少CPU和内存使用 - VisualVM

Finally both generate identical execution plans 最后两者都生成相同的执行计划

g(dfs)._jdf.queryExecution().optimizedPlan().equals( 
    f(dfs)._jdf.queryExecution().optimizedPlan()
)
## True

which indicates no difference when plans is evaluated and OOMs are likely to occur. 这表明在评估计划并且可能发生OOM时没有区别。

In other words you correlation doesn't imply causation, and observed performance problems are unlikely to be related to the method you use to combine DataFrames . 换句话说,您的相关性并不意味着因果关系,并且观察到的性能问题不太可能与您用于组合DataFrames的方法相关。

One reason is that a reduce or a fold is usually functionally pure: the result of each accumulation operation is not written to the same part of memory, but rather to a new block of memory. 一个原因是减少或折叠通常在功能上是纯粹的:每次累积操作的结果不是写入存储器的相同部分,而是写入新的存储器块。

In principle the garbage collector could free the previous block after each accumulation, but if it doesn't you'll allocate memory for each updated version of the accumulator. 原则上,垃圾收集器可以在每次累积后释放前一个块,但如果不是,则为每个更新版本的累加器分配内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM