如何迭代密集秩的数据集列以在 Scala 中创建另一列的数组?

[英]How to Iterate Dataset column of dense rank to create Array of another column in Scala?

My Input looks like below:我的输入如下所示:

val windowSpec1 = Window.partitionBy($"member_id",$"plan_id",$"err_cd").orderBy($"member_id")
val windowSpec2 = Window.partitionBy($"member_id",$"plan_id").orderBy($"err_cd")
val enrollmentData = inputData.select($"member_id", $"plan_id", $"err_cd")
     .withColumn("rk", row_number().over(windowSpec1))
     .withColumn("error_index", dense_rank().over(windowSpec2))

|member_id|           plan_id|err_cd| rk|error_index|
|    M0002|      12345MH22220| EH044|  1|          1|
|    M0002|      12345MH22220| EP049|  1|          2|
|    M0003|      12345MH33330| EP051|  1|          1|
|    M0003|      12345MH33330| EP053|  1|          2|
|    M0003|      12345MH33330| EP054|  1|          3|
|    M0003|      12345MH44440| EP054|  1|          1|

Required output:所需 output:

My error_codes column in the output dataset is a Seq of strings.我在 output 数据集中的error_codes列是一个字符串序列。 I need to make an array, can change Seq if not suited.我需要制作一个数组,如果不适合可以更改 Seq。

|member_id|           plan_id|error_codes      |
|    M0002|      12345MH22220|EH044,EP049      |
|    M0003|      12345MH33330|EP051,EP053,EP054|
|    M0003|      12345MH44440|EP054            |

Please let me know if you have any suggestions.如果您有任何建议,请告诉我。

The functions collect_list, array_distinct and array_sort should help.函数 collect_list、array_distinct 和 array_sort 应该会有所帮助。

See the example below请参阅下面的示例

import org.apache.spark.sql.functions.{array_distinct, array_sort, collect_list, collect_set}

case class WithArray(name: String, descr: String, values: Seq[Int])

case class RawData(name: String, descr: String, value:Int)

class ArrayTrials extends BaseSpec {
  describe("let's play with arrays") {
    it("here we go") {
    import spark.implicits._
    val data = Seq(
      RawData("one", "the first one", 1),
      RawData("two", "the second one", 1),
        RawData("two", "the second one", 2),
      RawData("three", "the third one", 10),
        RawData("three", "the third one", 20),
      RawData("three", "the third one", 20),
        RawData("three", "the third one", 30)).toDS
    val expected = Seq(WithArray("one", "the first one", Seq(1)),
      WithArray("two", "the second one", Seq(1, 2)),
      WithArray("three", "the third one", Seq(10, 20, 30)))
    val results = data.groupBy($"name", $"descr")
    results.collect should contain theSameElementsAs expected

