简体   繁体   English

如何迭代密集秩的数据集列以在 Scala 中创建另一列的数组?

[英]How to Iterate Dataset column of dense rank to create Array of another column in Scala?

My Input looks like below:我的输入如下所示:

val windowSpec1 = Window.partitionBy($"member_id",$"plan_id",$"err_cd").orderBy($"member_id")
    
val windowSpec2 = Window.partitionBy($"member_id",$"plan_id").orderBy($"err_cd")
    
    
val enrollmentData = inputData.select($"member_id", $"plan_id", $"err_cd")
     .withColumn("rk", row_number().over(windowSpec1))
     .withColumn("error_index", dense_rank().over(windowSpec2))

+---------+------------------+------+---+-----------+
|member_id|           plan_id|err_cd| rk|error_index|
+---------+------------------+------+---+-----------+
|    M0002|      12345MH22220| EH044|  1|          1|
|    M0002|      12345MH22220| EP049|  1|          2|
|    M0003|      12345MH33330| EP051|  1|          1|
|    M0003|      12345MH33330| EP053|  1|          2|
|    M0003|      12345MH33330| EP054|  1|          3|
|    M0003|      12345MH44440| EP054|  1|          1|
+---------+------------------+------+---+-----------+

Required output:所需 output:

My error_codes column in the output dataset is a Seq of strings.我在 output 数据集中的error_codes列是一个字符串序列。 I need to make an array, can change Seq if not suited.我需要制作一个数组,如果不适合可以更改 Seq。

+---------+------------------+-----------------+
|member_id|           plan_id|error_codes      |
+---------+------------------+-----------------+
|    M0002|      12345MH22220|EH044,EP049      |
|    M0003|      12345MH33330|EP051,EP053,EP054|
|    M0003|      12345MH44440|EP054            |
+---------+------------------+-----------------+

Please let me know if you have any suggestions.如果您有任何建议,请告诉我。

The functions collect_list, array_distinct and array_sort should help.函数 collect_list、array_distinct 和 array_sort 应该会有所帮助。

See the example below请参阅下面的示例

import org.apache.spark.sql.functions.{array_distinct, array_sort, collect_list, collect_set}

case class WithArray(name: String, descr: String, values: Seq[Int])

case class RawData(name: String, descr: String, value:Int)

class ArrayTrials extends BaseSpec {
  describe("let's play with arrays") {
    it("here we go") {
    import spark.implicits._
      
    val data = Seq(
      RawData("one", "the first one", 1),
      RawData("two", "the second one", 1),
        RawData("two", "the second one", 2),
      RawData("three", "the third one", 10),
        RawData("three", "the third one", 20),
      RawData("three", "the third one", 20),
        RawData("three", "the third one", 30)).toDS
        
    val expected = Seq(WithArray("one", "the first one", Seq(1)),
      WithArray("two", "the second one", Seq(1, 2)),
      WithArray("three", "the third one", Seq(10, 20, 30)))
        
    val results = data.groupBy($"name", $"descr")
      .agg(array_sort(array_distinct(collect_list($"value"))).as("values"))
      .as[WithArray]
        
    results.collect should contain theSameElementsAs expected
    }
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 迭代 Spark Scala 中 Array 的 dataframe 列 Array - Iterate dataframe column Array of Array in Spark Scala 希望基于 Array(Float) 类型的另一列创建“rank arrays”列 - Looking to create column of "rank arrays" based on another column of Array(Float) type 如何在Spark Scala中迭代数据框中的每一列 - How to Iterate each column in a Dataframe in Spark Scala 如何遍历另一列中的嵌套字段以基于另一个值创建新列? - How to iterate over a nested field in another column to create a new column based off another value? 如何在熊猫数据框中创建密集排名时跳过列的空值? - How to skip the null value of a column while creating the dense rank in pandas dataframe? 如何基于另一列在R数据框中对列进行排名 - How to rank column in r data frame based on another column 基于另一个在 dataframe 中创建新列,并与 R 中的另一个数据集匹配 - Create new column in dataframe based on another and matching to another dataset in R 如何使用列名而不是值创建秩矩阵? - How to create a rank matrix with column names instead of values? 如何迭代以列表列表为值的列并创建新列 - How to iterate over a column with list of lists as values and create a new column 如何根据数据集中的另一列查找一列的均值 - How to find mean for one column based on another column in dataset
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM