簡體   English   中英

如何迭代密集秩的數據集列以在 Scala 中創建另一列的數組?

[英]How to Iterate Dataset column of dense rank to create Array of another column in Scala?

我的輸入如下所示:

val windowSpec1 = Window.partitionBy($"member_id",$"plan_id",$"err_cd").orderBy($"member_id")
    
val windowSpec2 = Window.partitionBy($"member_id",$"plan_id").orderBy($"err_cd")
    
    
val enrollmentData = inputData.select($"member_id", $"plan_id", $"err_cd")
     .withColumn("rk", row_number().over(windowSpec1))
     .withColumn("error_index", dense_rank().over(windowSpec2))

+---------+------------------+------+---+-----------+
|member_id|           plan_id|err_cd| rk|error_index|
+---------+------------------+------+---+-----------+
|    M0002|      12345MH22220| EH044|  1|          1|
|    M0002|      12345MH22220| EP049|  1|          2|
|    M0003|      12345MH33330| EP051|  1|          1|
|    M0003|      12345MH33330| EP053|  1|          2|
|    M0003|      12345MH33330| EP054|  1|          3|
|    M0003|      12345MH44440| EP054|  1|          1|
+---------+------------------+------+---+-----------+

所需 output:

我在 output 數據集中的error_codes列是一個字符串序列。 我需要制作一個數組,如果不適合可以更改 Seq。

+---------+------------------+-----------------+
|member_id|           plan_id|error_codes      |
+---------+------------------+-----------------+
|    M0002|      12345MH22220|EH044,EP049      |
|    M0003|      12345MH33330|EP051,EP053,EP054|
|    M0003|      12345MH44440|EP054            |
+---------+------------------+-----------------+

如果您有任何建議,請告訴我。

函數 collect_list、array_distinct 和 array_sort 應該會有所幫助。

請參閱下面的示例

import org.apache.spark.sql.functions.{array_distinct, array_sort, collect_list, collect_set}

case class WithArray(name: String, descr: String, values: Seq[Int])

case class RawData(name: String, descr: String, value:Int)

class ArrayTrials extends BaseSpec {
  describe("let's play with arrays") {
    it("here we go") {
    import spark.implicits._
      
    val data = Seq(
      RawData("one", "the first one", 1),
      RawData("two", "the second one", 1),
        RawData("two", "the second one", 2),
      RawData("three", "the third one", 10),
        RawData("three", "the third one", 20),
      RawData("three", "the third one", 20),
        RawData("three", "the third one", 30)).toDS
        
    val expected = Seq(WithArray("one", "the first one", Seq(1)),
      WithArray("two", "the second one", Seq(1, 2)),
      WithArray("three", "the third one", Seq(10, 20, 30)))
        
    val results = data.groupBy($"name", $"descr")
      .agg(array_sort(array_distinct(collect_list($"value"))).as("values"))
      .as[WithArray]
        
    results.collect should contain theSameElementsAs expected
    }
  }
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM