[英]Using a defined value in a filter function concerning a DataFrame in Scala Spark
I tried to find a similar issue but didn't find anything related.. I'm new to Spark and Scala and I'm having trouble with a specific case. 我试图找到一个类似的问题,但是没有找到任何相关的问题。我是Spark和Scala的新手,在处理特定案例时遇到了麻烦。
I have a DataFrame as the following : 我有一个DataFrame如下:
+---+---+------------+
|src|dst|relationship|
+---+---+------------+
| 0|238| 41.0|
| 0|159| 46.0|
|238| 12| 36.0|
| 1|235| 44.0|
| 2|139| 50.0|
+---+---+------------+
My problem is : (1) I want get the destination with the lowest value of "relationship" for the src = 0 and (2) reuse this value. 我的问题是:(1)我想为src = 0获得具有最低“关系”值的目的地,并且(2)重用此值。 I'm able to get something for (1) using both
val j = orderedSrc.filter("src == 1").orderBy("relationship").select("dst").take(5)
and val h = j(0)(0)
. 我可以同时使用
val j = orderedSrc.filter("src == 1").orderBy("relationship").select("dst").take(5)
和val h = j(0)(0)
In my example it would return j: Array[org.apache.spark.sql.Row] = Array([238], [159])
and h: Any = 238
. 在我的示例中,它将返回
j: Array[org.apache.spark.sql.Row] = Array([238], [159])
和h: Any = 238
。
My question is concerning (2): 我的问题与(2)有关:
How can use this h
value inside the previous query? 如何在上一个查询中使用此
h
值? Something that would look like val j = orderedSrc.filter("src==h").orderBy("relationship").select("dst").take(5)
which would return Array[org.apache.spark.sql.Row] = Array([12])
? 看起来像
val j = orderedSrc.filter("src==h").orderBy("relationship").select("dst").take(5)
将返回Array[org.apache.spark.sql.Row] = Array([12])
吗?
By advance thanks if you can help :-)! 在此先感谢您的帮助:-)!
As @Lamanus wrote, the solution was : 正如@Lamanus所写,解决方案是:
orderedSrc.filter($"src" === h).orderBy("relationship").select($"dst").take(5)
Many thanks! 非常感谢!
you don't need to use take
in the intermediate steps (this wont scale), use a join instead: 您不需要在中间步骤中使用
take
(这不会扩展),而是使用联接:
val df = Seq(
(0,238,41.0),
(0,159, 46.0),
(238,12,36.0),
(1, 235, 44.0),
(2,139,50.0)
).toDF("src","dest","relationship")
val h = df.where($"src"===0)
.select(min(struct($"relationship",$"dest")).as("min"))
df
.join(h,df("src")===h("min.dest"),"leftsemi")
.show()
+---+----+------------+
|src|dest|relationship|
+---+----+------------+
|238| 12| 36.0|
+---+----+------------+
Or the same with Window-functions: 或与窗口功能相同:
df
.withColumn("selector",min(when($"src"===0,struct($"relationship",$"dest"))).over(Window.partitionBy()))
.where($"src"===$"selector.dest")
.drop($"selector")
.show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.