简体   繁体   English

从另一个数据源读取路径时,如何避免使用 collectAsList()?

[英]How can I avoid a collectAsList() when path is read from another data source?

I have the parquet path present as a table column, need to then pass this column list as an input to readFrom parquet.我将镶木地板路径作为表列显示,然后需要将此列列表作为输入传递给 readFrom 镶木地板。

List<Row> rows = spark.read.<datasource>.select("path").collectAsList();

List<String> paths =  <convert the rows to string>

spark.read.parquet(paths). 

The collectAsList is an expensive operation with data being brought to the driver. collectAsList 是一个昂贵的操作,数据被带到驱动程序。

Is there a better approach?有更好的方法吗?

Is there a better approach?有更好的方法吗?

No, there is no alternative way.不,没有其他方法。

The code represented by spark.read.parquet will always be executed on the driver. spark.read.parquet表示的代码将始终在驱动程序上执行。 The driver will tell individual executors which part of with parquet file the respective executor should load and the executors will load the data.驱动程序将告诉各个执行程序各自的执行程序应该加载 parquet 文件的哪一部分,然后执行程序将加载数据。 But the coordination which executor should handle which part of a parquet file is the task of the driver.但是协调哪个执行者应该处理镶木地板文件的哪一部分是驱动程序的任务。 So the paths have to be shipped to the driver.因此必须将路径发送给驱动程序。

After the bad news here is the good part: it is true that collectAsList is expensive but it is not that expensive .坏消息之后是好消息: collectAsList很昂贵,但并没有那么昂贵 collectAsList is expensive when dealing with huge dataframes.在处理巨大的数据帧时, collectAsList非常昂贵。 Huge in this context means hundereds of millions of rows.在这种情况下,巨大意味着数以百万计的行。 I doubt that you are planning to load that many parquet files.我怀疑您是否打算加载那么多镶木地板文件。 As long as the list of paths "only" contains a few ten thousand rows there is nothing wrong with sending this list to the driver.只要路径列表“仅”包含几万行,将此列表发送给驱动程序就没有问题。 A standard JVM that runs the driver will easily handle such a list.运行驱动程序的标准 JVM 将轻松处理此类列表。

Another approach is to use a CollectionAccumulator.另一种方法是使用 CollectionAccumulator。

df.javaRDD().forEachPartition{
//executor code
//add values to the accumulator
}

//Note that this is executed in the driver
List<String> paths = accumulator.value

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为我的 dataframe 创建新列时如何避免此 TypeError 将另一列中的值相乘 - How can I avoid this TypeError when creating a new column for my dataframe which multiplies the values in another column 如何使用 SparkR 从 delta lib 读取数据? - How can I read data from delta lib using SparkR? 如何从Spark中的Hbase表读取数据? - How can i read data from Hbase table in Spark? 如何使用 Spark 从 Azurite 读取/写入数据? - How can I read/write data from Azurite using Spark? 如何使用 pySpark 中的 JDBC 读取 Cassandra 数据? - How can I read Cassandra data using JDBC from pySpark? 从JDBC源迁移数据时如何优化分区? - How to optimize partitioning when migrating data from JDBC source? 我可以使用 Spark 从 Cassandra 更快地读取数据吗? - Can I read data faster from Cassandra using Spark? 加载源Java类时如何使用Spark Shell读取Avro文件(从Java类生成)? - How to read Avro files (generated from Java class) using Spark shell when the source Java class is loaded? 当数据很大时,如何从cassandra缓存数据到spark? - When the data is big, How can I cache data from cassandra to spark? 从 dataframe 列读取路径并从 dataframe 添加另一列 - Read a path from a dataframe column and add another column from the dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM