从另一个数据源读取路径时，如何避免使用 collectAsList()？

Question

I have the parquet path present as a table column, need to then pass this column list as an input to readFrom parquet.我将镶木地板路径作为表列显示，然后需要将此列列表作为输入传递给 readFrom 镶木地板。

List<Row> rows = spark.read.<datasource>.select("path").collectAsList();

List<String> paths =  <convert the rows to string>

spark.read.parquet(paths).

The collectAsList is an expensive operation with data being brought to the driver. collectAsList 是一个昂贵的操作，数据被带到驱动程序。

Is there a better approach?有更好的方法吗？

Answer 1

Is there a better approach?有更好的方法吗？

No, there is no alternative way.不，没有其他方法。

The code represented by spark.read.parquet will always be executed on the driver. spark.read.parquet表示的代码将始终在驱动程序上执行。 The driver will tell individual executors which part of with parquet file the respective executor should load and the executors will load the data.驱动程序将告诉各个执行程序各自的执行程序应该加载 parquet 文件的哪一部分，然后执行程序将加载数据。 But the coordination which executor should handle which part of a parquet file is the task of the driver.但是协调哪个执行者应该处理镶木地板文件的哪一部分是驱动程序的任务。 So the paths have to be shipped to the driver.因此必须将路径发送给驱动程序。

After the bad news here is the good part: it is true that collectAsList is expensive but it is not that expensive .坏消息之后是好消息： collectAsList很昂贵，但并没有那么昂贵。 collectAsList is expensive when dealing with huge dataframes.在处理巨大的数据帧时， collectAsList非常昂贵。 Huge in this context means hundereds of millions of rows.在这种情况下，巨大意味着数以百万计的行。 I doubt that you are planning to load that many parquet files.我怀疑您是否打算加载那么多镶木地板文件。 As long as the list of paths "only" contains a few ten thousand rows there is nothing wrong with sending this list to the driver.只要路径列表“仅”包含几万行，将此列表发送给驱动程序就没有问题。 A standard JVM that runs the driver will easily handle such a list.运行驱动程序的标准 JVM 将轻松处理此类列表。

Answer 2

Another approach is to use a CollectionAccumulator.另一种方法是使用 CollectionAccumulator。

df.javaRDD().forEachPartition{
//executor code
//add values to the accumulator
}

//Note that this is executed in the driver
List<String> paths = accumulator.value

从另一个数据源读取路径时，如何避免使用 collectAsList()？

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-07-07 20:09:02

解决方案2
0 2022-02-13 14:22:49

从另一个数据源读取路径时，如何避免使用 collectAsList()？

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-07-07 20:09:02

解决方案2 0 2022-02-13 14:22:49

解决方案1
1 已采纳 2021-07-07 20:09:02

解决方案2
0 2022-02-13 14:22:49