简体   繁体   中英

Finding Percent that are Different from Selected Value Within Specific Groupings for Each Column (Spark Dataframe)

Hi all so I have an interesting and difficult problem.

Imagine a Spark dataframe like so:

A  B  C  D  E
1  q  2  3  4
1  t  5  3  5
1  r  1  2  5
2  r  3  1  3
2  t  8  1  3
2  q  1  2  3
3  t  1  1  2
3  r  2  1  2
3  r  3  1  1

Now I have a quite complex problem.

First I want to group by column A. Then I want to find the argmax for column C where column B is equal to r. Then, I want to groupby again for each B not equal to R. Then, I want to compare all other values in the groupby to the 'maximal' value selected previously, for each subsequent column (D and E), and find the percentage that match and the counts.

Thus, the output will be:

A  B  TotalCount  Percent-D-Match  Count-D-Match  Percent-E-Match  Count-E-Match
1  q  1           0                0              0                0
1  t  1           0                0              1                1
2  q  1           0                0              1                1
2  t  1           1                1              1                1
3  t  1           1                1              0                0

I imagine this will be a complex udaf but I'm unsure how to even approach this. Thanks.

According to what I understood from your question, you can use the following logic

first step would be to calculate two temporary dataframe s for maxR and maxNotR

val maxR = df.filter($"B" === "r").groupBy("A").agg(max("C").as("maxR"))
val maxNotR = df.filter($"B" =!= "r").groupBy("A").agg(max("C").as("maxNotR"))

Next step would be to join them with the original dataframe

val joinedDF = df.join(maxR, Seq("A"), "left").join(maxNotR, Seq("A"), "left")

Since you don't need the rows with r in column B , you can filter them and generate the TotalCount column

val dff = joinedDF.filter($"B" =!= "r").groupBy("A", "B", "D", "E", "maxR", "maxNotR").agg(count("B").as("TotalCount"))

Final step would be to calculate the expected output by comparing the columns

  dff.select($"A",
      $"B",
      $"TotalCount",
      when($"D" === $"maxR" || $"D" === $"maxNotR", 1).otherwise(0).as("Percent-D-Match"),
      (when($"D" === $"maxR", 1).otherwise(0)+when($"D" === $"maxNotR", 1).otherwise(0)).as("Count-D-Match"),
      when($"E" === $"maxR" || $"E" === $"maxNotR", 1).otherwise(0).as("Percent-E-Match"),
      (when($"E" === $"maxR", 1).otherwise(0)+when($"E" === $"maxNotR", 1).otherwise(0)).as("Count-E-Match")
    )

This would lead you to final dataframe as

    +---+---+----------+---------------+-------------+---------------+-------------+
|A  |B  |TotalCount|Percent-D-Match|Count-D-Match|Percent-E-Match|Count-E-Match|
+---+---+----------+---------------+-------------+---------------+-------------+
|1  |q  |1         |0              |0            |0              |0            |
|2  |t  |1         |0              |0            |1              |1            |
|3  |t  |1         |1              |1            |0              |0            |
|2  |q  |1         |0              |0            |1              |1            |
|1  |t  |1         |0              |0            |1              |1            |
+---+---+----------+---------------+-------------+---------------+-------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM