[英]Join two dataframe using Spark Scala
I have this code :我有这个代码:
val o = p_value.alias("d1").join(t_d.alias("d2"),
(col("d1.origin_latitude")===col("d2.origin_latitude")&&
col("d1.origin_longitude")===col("d2.origin_longitude")),"left").
filter(col("d2.origin_longitude").isNull)
val c = p_value2.alias("d3").join(o.alias("d4"),
(col("d3.origin_latitude")===col("d4.origin_latitude") &&
col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
filter(col("d3.origin_longitude").isNull)
I get this error :我收到此错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'd4.origin_latitude' is ambiguous, could be: d4.origin_latitude, d4.origin_latitude.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
On this line在这条线上
(col("d3.origin_latitude")===col("d4.origin_latitude") && col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
Any idea ?任何的想法 ?
Thank you .谢谢你 。
You are aliasing DataFrame
not columns, which is used to access/refer columns in that DataFrame
.您是别名
DataFrame
而不是列,它用于访问/引用该DataFrame
列。 So the first join will result into another DataFrame
having same column name twice ( origin_latitude
as well as origin_longitude
).因此,第一次连接将导致另一个
DataFrame
具有两次相同的列名( origin_latitude
以及origin_longitude
)。 Once you try to access one of these columns in resultant DataFrame
, you are going to get Ambiguity
error.一旦您尝试访问结果
DataFrame
的这些列之一,您将收到Ambiguity
错误。
So you need to make sure that DataFrame
contains each column only once.所以你需要确保
DataFrame
只包含每列一次。 You can rewrite the first join as below:您可以按如下方式重写第一个连接:
p_value
.join(t_d, Seq("origin_latitude", "origin_longitude"), "left")
.filter(t_d.col("t_d.origin_longitude").isNull)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.