[英]Reading ambiguous column name in Spark sql Dataframe using scala
[英]Spark[Scala]: Ambiguous groupBy column name
因此,在测试时,我收到此错误消息:
org.apache.spark.sql.AnalysisException: Reference 'from' is ambiguous, could be: from, from.;
在测试时而不是在spark-shell
运行零件时...?
我正在以下数据帧上进行交叉联接:
scala> timeSpanDF.show
+----------+----------+
| from| to|
+----------+----------+
|2018-01-01|2018-02-01|
|2018-01-01|2018-03-01|
|2018-02-01|2018-03-01|
+----------+----------+
scala> df.show
+------------+----------+--------+-----+--------------------+
|pressroom_id| month|category|event| email|
+------------+----------+--------+-----+--------------------+
| 1|2017-01-01| contact| open|somebody@example.com|
| 1|2018-01-01| contact| open| me1@example.com|
| 1|2018-02-01| contact| open| me1@example.com|
| 1|2018-02-01| contact| open| me1@example.com|
| 1|2018-01-01| contact| open| you@example.com|
| 1|2018-03-01| contact| open| etc@example.com|
| 1|2018-02-01| contact| open| me2@example.com|
| 1|2018-02-01| contact| open| me2@example.com|
| 2|2018-01-01| contact| open| me1@example.com|
+------------+----------+--------+-----+--------------------+
所以我这样做
val joinedDF = timeSpansDF
.crossJoin(df)
.filter(
df("month") >= timeSpansDF("from")
&& df("month") < timeSpansDF("to")
)
.distinct
并得到这个
scala> joinedDF.show
+----------+----------+------------+----------+--------+-----+---------------+
| from| to|pressroom_id| month|category|event| email|
+----------+----------+------------+----------+--------+-----+---------------+
|2018-01-01|2018-03-01| 2|2018-01-01| contact| open|me1@example.com|
|2018-02-01|2018-03-01| 1|2018-02-01| contact| open|me1@example.com|
|2018-02-01|2018-03-01| 1|2018-02-01| contact| open|me2@example.com|
|2018-01-01|2018-03-01| 1|2018-01-01| contact| open|me1@example.com|
|2018-01-01|2018-02-01| 1|2018-01-01| contact| open|me1@example.com|
|2018-01-01|2018-03-01| 1|2018-02-01| contact| open|me2@example.com|
|2018-01-01|2018-02-01| 2|2018-01-01| contact| open|me1@example.com|
|2018-01-01|2018-03-01| 1|2018-01-01| contact| open|you@example.com|
|2018-01-01|2018-03-01| 1|2018-02-01| contact| open|me1@example.com|
|2018-01-01|2018-02-01| 1|2018-01-01| contact| open|you@example.com|
+----------+----------+------------+----------+--------+-----+---------------+
然后,稍后我想像这样聚合该表,这是我收到奇怪消息的地方:
joinedDF.where(col("category") === lit(category) && col("event") === lit("open"))
.groupBy("pressroom_id", "from", "to")
.agg(count("email").cast("integer").as("something"))
指向groupBy。 奇怪的是,这在外壳程序中有效,但是当将这些操作放入函数中并使用scalaTest进行测试时,它们会出错吗?
这是怎么回事DOC?
由于我没有代码来生成joinedDF,因此我自己准备了Dataframe来生成joinDF。我已经在ScalaTest中对其进行了测试,并且工作正常。
请如下更新您的代码。
val df = Seq(("2018-01-01", "2018-03-01", 2,"contact","open","me1@example.com"),
("2018-02-01","2018-03-01",1, "contact","open","me1@example.com"),
("2018-01-01","2018-03-01",1, "contact","open","you@example.com"),
("2018-02-01","2018-03-01",1, "contact","open","me1@example.com"),
("2018-01-01","2018-02-01",1, "contact","open","me1@example.com"),
("2018-01-01","2018-02-01", 1, "contact","open","you@example.com")).
toDF("from", "to", "pressroom_id","category","event","email")
df.show()
+----------+----------+------------+--------+-----+---------------+
| from| to|pressroom_id|category|event| email|
+----------+----------+------------+--------+-----+---------------+
|2018-01-01|2018-03-01| 2| contact| open|me1@example.com|
|2018-02-01|2018-03-01| 1| contact| open|me1@example.com|
|2018-01-01|2018-03-01| 1| contact| open|you@example.com|
|2018-02-01|2018-03-01| 1| contact| open|me1@example.com|
|2018-01-01|2018-02-01| 1| contact| open|me1@example.com|
|2018-01-01|2018-02-01| 1| contact| open|you@example.com|
+----------+----------+------------+--------+-----+---------------+
val df1 = df.where(col("category") === lit("contact") && col("event") === lit("open"))
.groupBy("pressroom_id", "from", "to")
.agg(count("email").cast("integer").as("something"))
df1.show()
+------------+----------+----------+---------+
|pressroom_id| from| to|something|
+------------+----------+----------+---------+
| 2|2018-01-01|2018-03-01| 1|
| 1|2018-01-01|2018-03-01| 1|
| 1|2018-02-01|2018-03-01| 2|
| 1|2018-01-01|2018-02-01| 2|
+------------+----------+----------+---------+
我在代码中添加了import语句。
import org.apache.spark.sql.functions._
希望对您有所帮助!
我不是Scala专家,但我是数据库管理员。
我怀疑您的问题源于使用SQL保留字from
作为列名,因为堆栈跟踪显示Exception源自Spark SQL模块: org.apache.spark.sql.AnalysisException
。
要么:
尝试将列名更改为非保留字; 要么
将列名完全限定为joinedDF.from
。
注意:您的第二个代码段引用了一个名为timeSpanDF
的数据帧,而您的第三个代码段则将其称为timeSpansDF
(复数)。
编辑:作为社区的新成员,我没有足够的声誉向@KZapagol的答案发布评论,但我相信他的答案的实质是原始海报的joinedDF.where
子句中有错字: col("category") === lit(category)
=> col("category") === lit("contact")
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.