简体   繁体   English

在每列的特定组中查找与选定值不同的百分比(Spark数据框)

[英]Finding Percent that are Different from Selected Value Within Specific Groupings for Each Column (Spark Dataframe)

Hi all so I have an interesting and difficult problem. 大家好,所以我遇到了一个有趣且困难的问题。

Imagine a Spark dataframe like so: 想象一下这样的Spark数据帧:

A  B  C  D  E
1  q  2  3  4
1  t  5  3  5
1  r  1  2  5
2  r  3  1  3
2  t  8  1  3
2  q  1  2  3
3  t  1  1  2
3  r  2  1  2
3  r  3  1  1

Now I have a quite complex problem. 现在我有一个非常复杂的问题。

First I want to group by column A. Then I want to find the argmax for column C where column B is equal to r. 首先,我想按列A分组。然后,我要找到列B等于r的列C的argmax。 Then, I want to groupby again for each B not equal to R. Then, I want to compare all other values in the groupby to the 'maximal' value selected previously, for each subsequent column (D and E), and find the percentage that match and the counts. 然后,我想再次对不等于R的每个B进行分组。然后,我想将groupby中的所有其他值与先前为每个后续列(D和E)选择的“最大值”值进行比较,并找到百分比匹配和计数。

Thus, the output will be: 因此,输出将是:

A  B  TotalCount  Percent-D-Match  Count-D-Match  Percent-E-Match  Count-E-Match
1  q  1           0                0              0                0
1  t  1           0                0              1                1
2  q  1           0                0              1                1
2  t  1           1                1              1                1
3  t  1           1                1              0                0

I imagine this will be a complex udaf but I'm unsure how to even approach this. 我想这将是一个复杂的udaf,但是我不确定如何解决这个问题。 Thanks. 谢谢。

According to what I understood from your question, you can use the following logic 根据您对问题的理解,可以使用以下逻辑

first step would be to calculate two temporary dataframe s for maxR and maxNotR 第一步是为maxRmaxNotR计算两个临时dataframe

val maxR = df.filter($"B" === "r").groupBy("A").agg(max("C").as("maxR"))
val maxNotR = df.filter($"B" =!= "r").groupBy("A").agg(max("C").as("maxNotR"))

Next step would be to join them with the original dataframe 下一步将是join与他们原来的dataframe

val joinedDF = df.join(maxR, Seq("A"), "left").join(maxNotR, Seq("A"), "left")

Since you don't need the rows with r in column B , you can filter them and generate the TotalCount column 由于您不需要B列中带有r的行,因此可以对其进行filter并生成TotalCount

val dff = joinedDF.filter($"B" =!= "r").groupBy("A", "B", "D", "E", "maxR", "maxNotR").agg(count("B").as("TotalCount"))

Final step would be to calculate the expected output by comparing the columns 最后一步是通过比较各列来计算预期输出

  dff.select($"A",
      $"B",
      $"TotalCount",
      when($"D" === $"maxR" || $"D" === $"maxNotR", 1).otherwise(0).as("Percent-D-Match"),
      (when($"D" === $"maxR", 1).otherwise(0)+when($"D" === $"maxNotR", 1).otherwise(0)).as("Count-D-Match"),
      when($"E" === $"maxR" || $"E" === $"maxNotR", 1).otherwise(0).as("Percent-E-Match"),
      (when($"E" === $"maxR", 1).otherwise(0)+when($"E" === $"maxNotR", 1).otherwise(0)).as("Count-E-Match")
    )

This would lead you to final dataframe as 这将导致您最终的dataframe

    +---+---+----------+---------------+-------------+---------------+-------------+
|A  |B  |TotalCount|Percent-D-Match|Count-D-Match|Percent-E-Match|Count-E-Match|
+---+---+----------+---------------+-------------+---------------+-------------+
|1  |q  |1         |0              |0            |0              |0            |
|2  |t  |1         |0              |0            |1              |1            |
|3  |t  |1         |1              |1            |0              |0            |
|2  |q  |1         |0              |0            |1              |1            |
|1  |t  |1         |0              |0            |1              |1            |
+---+---+----------+---------------+-------------+---------------+-------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Scala Spark 数据框中的列(名称)中删除特定值(列(编号)) - Remove Specific Value (Column (Number)) from Column(Name) in Scala Spark Dataframe 从火花数据帧中的不同 ROW 获取值 - Fetching value from a different ROW in a spark dataframe 如何根据另一列的值从 Spark DataFrame 中选择特定列? - How to select specific columns from Spark DataFrame based on the value of another column? 按火花 dataframe 中每一行的 map 类型列中的值排序 - Sort by value in map type column for each row in spark dataframe 使用 Scala 在列中删除包含特定值的 Spark DataFrame 行 - Drop rows of Spark DataFrame that contain specific value in column using Scala 从火花数据框中提取列值并将其添加到另一个数据框中 - Extract a column value from a spark dataframe and add it to another dataframe 如何从 spark dataframe 中删除特定列,然后删除 select 所有列 - How to drop specific column and then select all columns from spark dataframe 从特定列 scala spark 数据框中获取最小值和最大值 - get min and max from a specific column scala spark dataframe #SPARK #需要从spark Scala中的其他dataframe列分配dataframe列值 - #SPARK #Need to assign dataframe column value from other dataframe column in spark Scala Spark-从具有不同列类型的行数据框中删除特殊字符 - Spark - remove special characters from rows Dataframe with different column types
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM