在与Scala Spark中的DataFrame有关的过滤器函数中使用定义的值

Question

I tried to find a similar issue but didn't find anything related.. I'm new to Spark and Scala and I'm having trouble with a specific case. 我试图找到一个类似的问题，但是没有找到任何相关的问题。我是Spark和Scala的新手，在处理特定案例时遇到了麻烦。

I have a DataFrame as the following : 我有一个DataFrame如下：

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  0|238|        41.0|
|  0|159|        46.0|
|238| 12|        36.0|
|  1|235|        44.0|
|  2|139|        50.0|
+---+---+------------+

My problem is : (1) I want get the destination with the lowest value of "relationship" for the src = 0 and (2) reuse this value. 我的问题是：（1）我想为src = 0获得具有最低“关系”值的目的地，并且（2）重用此值。 I'm able to get something for (1) using both val j = orderedSrc.filter("src == 1").orderBy("relationship").select("dst").take(5) and val h = j(0)(0) . 我可以同时使用val j = orderedSrc.filter("src == 1").orderBy("relationship").select("dst").take(5)和val h = j(0)(0)

In my example it would return j: Array[org.apache.spark.sql.Row] = Array([238], [159]) and h: Any = 238 . 在我的示例中，它将返回j: Array[org.apache.spark.sql.Row] = Array([238], [159])和h: Any = 238 。

My question is concerning (2): 我的问题与（2）有关：

How can use this h value inside the previous query? 如何在上一个查询中使用此h值？ Something that would look like val j = orderedSrc.filter("src==h").orderBy("relationship").select("dst").take(5) which would return Array[org.apache.spark.sql.Row] = Array([12]) ? 看起来像val j = orderedSrc.filter("src==h").orderBy("relationship").select("dst").take(5)将返回Array[org.apache.spark.sql.Row] = Array([12])吗？

By advance thanks if you can help :-)! 在此先感谢您的帮助：-)！

Answer 1

As @Lamanus wrote, the solution was : 正如@Lamanus所写，解决方案是：

orderedSrc.filter($"src" === h).orderBy("relationship").select($"dst").take(5)

Many thanks! 非常感谢！

Answer 2

you don't need to use take in the intermediate steps (this wont scale), use a join instead: 您不需要在中间步骤中使用take （这不会扩展），而是使用联接：

val df = Seq(
  (0,238,41.0),
  (0,159, 46.0),
  (238,12,36.0),
  (1, 235, 44.0),
  (2,139,50.0)
).toDF("src","dest","relationship")


val h = df.where($"src"===0)
  .select(min(struct($"relationship",$"dest")).as("min"))

df
  .join(h,df("src")===h("min.dest"),"leftsemi")
  .show()

+---+----+------------+
|src|dest|relationship|
+---+----+------------+
|238|  12|        36.0|
+---+----+------------+

Or the same with Window-functions: 或与窗口功能相同：

df
  .withColumn("selector",min(when($"src"===0,struct($"relationship",$"dest"))).over(Window.partitionBy()))
  .where($"src"===$"selector.dest")
  .drop($"selector")
  .show()

在与Scala Spark中的DataFrame有关的过滤器函数中使用定义的值

问题描述

2 个解决方案

解决方案1
0 2019-08-16 15:38:25

解决方案2
0 已采纳 2019-08-16 18:48:06

在与Scala Spark中的DataFrame有关的过滤器函数中使用定义的值

问题描述

2 个解决方案

解决方案1 0 2019-08-16 15:38:25

解决方案2 0 已采纳 2019-08-16 18:48:06

解决方案1
0 2019-08-16 15:38:25

解决方案2
0 已采纳 2019-08-16 18:48:06