简体   繁体   English

Spark Java分区

[英]Spark Java Repartition

Java spark2 Java Spark2
Is there some difference in the two statements- 这两种陈述有什么区别吗?

Dataset<Row> Data; 


Data.repartition(new Column("key" )) ; 

and 

Data.repartition(Data.col("key" ) ;

Doing 在做

Data.repartition(new Column("key"));

is equivalent to 相当于

import static org.apache.spark.sql.functions.col;
Data.repartition(col("key"));

in these cases the column is not directly associated with a Dataset and Spark need to resolve it during the Analysis phase . 在这些情况下,该列不直接与数据集关联,因此Spark需要在分析阶段对其进行解析

If you use instead 如果您改用

Data.repartition(Data.col("key");

you are giving an instruction to Spark on which is the Dataset to which the column belongs. 您正在向Spark发出指令,该指令属于该列所属的数据集。 This method is mainly useful in joins where you can have, for example, two dataset with a common column name. 此方法主要用于连接(例如,两个具有相同列名的数据集)的连接中。

In your example the result is equivalent, you can use both forms. 在您的示例中,结果是等效的,您可以使用两种形式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM