简体   繁体   English

如何在spark中获取两个最接近的值 scala DataFrame

[英]How to get the two nearest values in spark scala DataFrame

Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:大家好我是 Spark scala 的新手。我想使用 spark scala 通过分区找到最接近的值。我的输入是这样的:

first row for example: value 1 is between 2 and 7 in the value2 columns例如第一行:值 1 在 value2 列中介于 2 和 7 之间

+--------+----------+----------+
|id      |value1    |value2    |
+--------+----------+----------+
|1       |3         |1         |
|1       |3         |2         |
|1       |3         |7         |

|2       |4         |2         |
|2       |4         |3         |
|2       |4         |8         |

|3       |5         |3         |
|3       |5         |6         |
|3       |5         |7         |
|3       |5         |8         |

My output should like this:我的 output 应该是这样的:

+--------+----------+----------+
|id      |value1    |value2    |
+--------+----------+----------+
|1       |3         |2         |
|1       |3         |7         |

|2       |4         |3         |
|2       |4         |8         |

|3       |5         |3         |
|3       |5         |6         |

Can someone guide me how to resolve this please.有人可以指导我如何解决这个问题。

Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.我没有提供您似乎想要学习的代码答案,而是为您提供了伪代码和参考,以便您自己找到答案。

  1. Group the elements (select id, value1) (aggregate on value2 with collect_list ) so you can collect all the value2 into an array.对元素进行 分组(select id, value1)(使用collect_list在 value2 上聚合),以便您可以将所有 value2 收集到一个数组中。
  2. select (id, and (add( concat ) value1 to the collect_list array)) Sorting the array . select (id, and (add( concat ) value1 to the collect_list array)) 排序数组
  3. find( array_position ) value1 in the array.在数组中找到( array_position ) value1。
  4. splice the array. splice数组。 retrieving value before and value after the result of ( array_position )检索 ( array_position ) 结果之前的值和之后的值
  5. If the array is less than 3 elements do error handling如果数组少于 3 个元素做错误处理
  6. now the last value in the array and the first value in the array are your 'closest numbers'.现在数组中的最后一个值和数组中的第一个值是你的“最接近的数字”。

You will need window functions for this.为此,您将需要window 个函数

val window = Window
  .partitionBy("id", "value1")
  .orderBy(asc("value2"))

val result = df
  .withColumn("prev", lag("value2").over(window))
  .withColumn("next", lead("value2").over(window))
  .withColumn("dist_prev", col("value2").minus(col("prev")))
  .withColumn("dist_next", col("next").minus(col("value2")))
  .withColumn("min", min(col("dist_prev")).over(window))
  .filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
  .drop("prev", "next", "dist_prev", "dist_next", "min")

I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.我还没有测试过它,所以将它更多地看作是概念的说明,而不是现成可用的示例。

Here is what's going on here:这是这里发生的事情:

  • First, create a window that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.首先,创建一个window来描述您的分组规则:我们希望行按前两列分组,并按每组中的第三列排序。
  • Next, add prev and next columns to the dataframe that contain the value of value2 column from previous and next row within the group respectively.接下来,将prevnext列添加到 dataframe,它们分别包含组中上一行和下一行的value2列的值。 ( prev will be null for the first row in the group, and next will be null for the last row – that is ok). (组中第一行的prev将是 null,最后一行的next将是 null - 没关系)。
  • Add dist_prev and dist_next to contain the distance between value2 and prev and next value respectively.添加dist_prevdist_next分别包含value2prev一个值和next值之间的距离。 (Note that dist_prev for each row will have the same value as dist_next for the previous row). (请注意,每一行的dist_prev将与前一行的dist_next具有相同的值)。
  • Find the minimum value for dist_prev within each group, and add it as min column (note, that the minimum value for dist_next is the same by construction, so we only need one column here).在每个组中找到dist_prev的最小值,并将其添加为min列(注意, dist_next的最小值在构造上是相同的,因此我们这里只需要一列)。
  • Filter the rows, selecting those that have the minimum value in either dist_next or dist_prev .筛选行,选择dist_nextdist_prev中具有最小值的行。 This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case.除非有多行彼此之间的距离相同,否则这会找到最紧密的对 - 这种情况未在您的问题中考虑,因此我们不知道在这种情况下您想要什么样的行为。 This implementation will simply return all of these rows.此实现将简单地返回所有这些行。
  • Finally, drop all extra columns that were added to the dataframe to return it to its original shape.最后,删除所有添加到 dataframe 的额外列,使其恢复到原来的形状。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM