[英]How to get the two nearest values in spark scala DataFrame
Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:大家好我是 Spark scala 的新手。我想使用 spark scala 通过分区找到最接近的值。我的输入是这样的:
first row for example: value 1 is between 2 and 7 in the value2 columns例如第一行:值 1 在 value2 列中介于 2 和 7 之间
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |1 |
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |2 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
|3 |5 |7 |
|3 |5 |8 |
My output should like this:我的 output 应该是这样的:
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
Can someone guide me how to resolve this please.有人可以指导我如何解决这个问题。
Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.我没有提供您似乎想要学习的代码答案,而是为您提供了伪代码和参考,以便您自己找到答案。
collect_list
) so you can collect all the value2 into an array.对元素进行 分组(select id, value1)(使用collect_list
在 value2 上聚合),以便您可以将所有 value2 收集到一个数组中。concat
) value1 to the collect_list
array)) Sorting the array . select (id, and (add( concat
) value1 to the collect_list
array)) 排序数组。array_position
) value1 in the array.在数组中找到( array_position
) value1。splice
the array. splice
数组。 retrieving value before and value after the result of ( array_position
)检索 ( array_position
) 结果之前的值和之后的值You will need window functions for this.为此,您将需要window 个函数。
val window = Window
.partitionBy("id", "value1")
.orderBy(asc("value2"))
val result = df
.withColumn("prev", lag("value2").over(window))
.withColumn("next", lead("value2").over(window))
.withColumn("dist_prev", col("value2").minus(col("prev")))
.withColumn("dist_next", col("next").minus(col("value2")))
.withColumn("min", min(col("dist_prev")).over(window))
.filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
.drop("prev", "next", "dist_prev", "dist_next", "min")
I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.我还没有测试过它,所以将它更多地看作是概念的说明,而不是现成可用的示例。
Here is what's going on here:这是这里发生的事情:
window
that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.首先,创建一个window
来描述您的分组规则:我们希望行按前两列分组,并按每组中的第三列排序。prev
and next
columns to the dataframe that contain the value of value2
column from previous and next row within the group respectively.接下来,将prev
和next
列添加到 dataframe,它们分别包含组中上一行和下一行的value2
列的值。 ( prev
will be null for the first row in the group, and next
will be null for the last row – that is ok). (组中第一行的prev
将是 null,最后一行的next
将是 null - 没关系)。dist_prev
and dist_next
to contain the distance between value2
and prev
and next
value respectively.添加dist_prev
和dist_next
分别包含value2
与prev
一个值和next
值之间的距离。 (Note that dist_prev
for each row will have the same value as dist_next
for the previous row). (请注意,每一行的dist_prev
将与前一行的dist_next
具有相同的值)。dist_prev
within each group, and add it as min
column (note, that the minimum value for dist_next
is the same by construction, so we only need one column here).在每个组中找到dist_prev
的最小值,并将其添加为min
列(注意, dist_next
的最小值在构造上是相同的,因此我们这里只需要一列)。dist_next
or dist_prev
.筛选行,选择dist_next
或dist_prev
中具有最小值的行。 This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case.除非有多行彼此之间的距离相同,否则这会找到最紧密的对 - 这种情况未在您的问题中考虑,因此我们不知道在这种情况下您想要什么样的行为。 This implementation will simply return all of these rows.此实现将简单地返回所有这些行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.