如何在spark中获取两个最接近的值 scala DataFrame

Question

Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:大家好我是 Spark scala 的新手。我想使用 spark scala 通过分区找到最接近的值。我的输入是这样的：

first row for example: value 1 is between 2 and 7 in the value2 columns例如第一行：值 1 在 value2 列中介于 2 和 7 之间

+--------+----------+----------+
|id      |value1    |value2    |
+--------+----------+----------+
|1       |3         |1         |
|1       |3         |2         |
|1       |3         |7         |

|2       |4         |2         |
|2       |4         |3         |
|2       |4         |8         |

|3       |5         |3         |
|3       |5         |6         |
|3       |5         |7         |
|3       |5         |8         |

My output should like this:我的 output 应该是这样的：

+--------+----------+----------+
|id      |value1    |value2    |
+--------+----------+----------+
|1       |3         |2         |
|1       |3         |7         |

|2       |4         |3         |
|2       |4         |8         |

|3       |5         |3         |
|3       |5         |6         |

Can someone guide me how to resolve this please.有人可以指导我如何解决这个问题。

Answer 1

Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.我没有提供您似乎想要学习的代码答案，而是为您提供了伪代码和参考，以便您自己找到答案。

Group the elements (select id, value1) (aggregate on value2 with collect_list ) so you can collect all the value2 into an array.对元素进行分组(select id, value1)（使用collect_list在 value2 上聚合），以便您可以将所有 value2 收集到一个数组中。
select (id, and (add( concat ) value1 to the collect_list array)) Sorting the array . select (id, and (add( concat ) value1 to the collect_list array)) 排序数组。
find( array_position ) value1 in the array.在数组中找到( array_position ) value1。
splice the array. splice数组。 retrieving value before and value after the result of ( array_position )检索 ( array_position ) 结果之前的值和之后的值
If the array is less than 3 elements do error handling如果数组少于 3 个元素做错误处理
now the last value in the array and the first value in the array are your 'closest numbers'.现在数组中的最后一个值和数组中的第一个值是你的“最接近的数字”。

Answer 2

You will need window functions for this.为此，您将需要window 个函数。

val window = Window
  .partitionBy("id", "value1")
  .orderBy(asc("value2"))

val result = df
  .withColumn("prev", lag("value2").over(window))
  .withColumn("next", lead("value2").over(window))
  .withColumn("dist_prev", col("value2").minus(col("prev")))
  .withColumn("dist_next", col("next").minus(col("value2")))
  .withColumn("min", min(col("dist_prev")).over(window))
  .filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
  .drop("prev", "next", "dist_prev", "dist_next", "min")

I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.我还没有测试过它，所以将它更多地看作是概念的说明，而不是现成可用的示例。

Here is what's going on here:这是这里发生的事情：

First, create a window that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.首先，创建一个window来描述您的分组规则：我们希望行按前两列分组，并按每组中的第三列排序。
Next, add prev and next columns to the dataframe that contain the value of value2 column from previous and next row within the group respectively.接下来，将prev和next列添加到 dataframe，它们分别包含组中上一行和下一行的value2列的值。 ( prev will be null for the first row in the group, and next will be null for the last row – that is ok). （组中第一行的prev将是 null，最后一行的next将是 null - 没关系）。
Add dist_prev and dist_next to contain the distance between value2 and prev and next value respectively.添加dist_prev和dist_next分别包含value2与prev一个值和next值之间的距离。 (Note that dist_prev for each row will have the same value as dist_next for the previous row). （请注意，每一行的dist_prev将与前一行的dist_next具有相同的值）。
Find the minimum value for dist_prev within each group, and add it as min column (note, that the minimum value for dist_next is the same by construction, so we only need one column here).在每个组中找到dist_prev的最小值，并将其添加为min列（注意， dist_next的最小值在构造上是相同的，因此我们这里只需要一列）。
Filter the rows, selecting those that have the minimum value in either dist_next or dist_prev .筛选行，选择dist_next或dist_prev中具有最小值的行。 This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case.除非有多行彼此之间的距离相同，否则这会找到最紧密的对 - 这种情况未在您的问题中考虑，因此我们不知道在这种情况下您想要什么样的行为。 This implementation will simply return all of these rows.此实现将简单地返回所有这些行。
Finally, drop all extra columns that were added to the dataframe to return it to its original shape.最后，删除所有添加到 dataframe 的额外列，使其恢复到原来的形状。

如何在spark中获取两个最接近的值 scala DataFrame

问题描述

2 个解决方案

解决方案1
0 2023-01-18 13:41:45

解决方案2
0 2023-01-18 14:35:35

如何在spark中获取两个最接近的值 scala DataFrame

问题描述

2 个解决方案

解决方案1 0 2023-01-18 13:41:45

解决方案2 0 2023-01-18 14:35:35

解决方案1
0 2023-01-18 13:41:45

解决方案2
0 2023-01-18 14:35:35