Finding Percent that are Different from Selected Value Within Specific Groupings for Each Column (Spark Dataframe)

Question

Hi all so I have an interesting and difficult problem.

Imagine a Spark dataframe like so:

A  B  C  D  E
1  q  2  3  4
1  t  5  3  5
1  r  1  2  5
2  r  3  1  3
2  t  8  1  3
2  q  1  2  3
3  t  1  1  2
3  r  2  1  2
3  r  3  1  1

Now I have a quite complex problem.

First I want to group by column A. Then I want to find the argmax for column C where column B is equal to r. Then, I want to groupby again for each B not equal to R. Then, I want to compare all other values in the groupby to the 'maximal' value selected previously, for each subsequent column (D and E), and find the percentage that match and the counts.

Thus, the output will be:

A  B  TotalCount  Percent-D-Match  Count-D-Match  Percent-E-Match  Count-E-Match
1  q  1           0                0              0                0
1  t  1           0                0              1                1
2  q  1           0                0              1                1
2  t  1           1                1              1                1
3  t  1           1                1              0                0

I imagine this will be a complex udaf but I'm unsure how to even approach this. Thanks.

Answer 1

According to what I understood from your question, you can use the following logic

first step would be to calculate two temporary dataframe s for maxR and maxNotR

val maxR = df.filter($"B" === "r").groupBy("A").agg(max("C").as("maxR"))
val maxNotR = df.filter($"B" =!= "r").groupBy("A").agg(max("C").as("maxNotR"))

Next step would be to join them with the original dataframe

val joinedDF = df.join(maxR, Seq("A"), "left").join(maxNotR, Seq("A"), "left")

Since you don't need the rows with r in column B , you can filter them and generate the TotalCount column

val dff = joinedDF.filter($"B" =!= "r").groupBy("A", "B", "D", "E", "maxR", "maxNotR").agg(count("B").as("TotalCount"))

Final step would be to calculate the expected output by comparing the columns

  dff.select($"A",
      $"B",
      $"TotalCount",
      when($"D" === $"maxR" || $"D" === $"maxNotR", 1).otherwise(0).as("Percent-D-Match"),
      (when($"D" === $"maxR", 1).otherwise(0)+when($"D" === $"maxNotR", 1).otherwise(0)).as("Count-D-Match"),
      when($"E" === $"maxR" || $"E" === $"maxNotR", 1).otherwise(0).as("Percent-E-Match"),
      (when($"E" === $"maxR", 1).otherwise(0)+when($"E" === $"maxNotR", 1).otherwise(0)).as("Count-E-Match")
    )

This would lead you to final dataframe as

    +---+---+----------+---------------+-------------+---------------+-------------+
|A  |B  |TotalCount|Percent-D-Match|Count-D-Match|Percent-E-Match|Count-E-Match|
+---+---+----------+---------------+-------------+---------------+-------------+
|1  |q  |1         |0              |0            |0              |0            |
|2  |t  |1         |0              |0            |1              |1            |
|3  |t  |1         |1              |1            |0              |0            |
|2  |q  |1         |0              |0            |1              |1            |
|1  |t  |1         |0              |0            |1              |1            |
+---+---+----------+---------------+-------------+---------------+-------------+

Finding Percent that are Different from Selected Value Within Specific Groupings for Each Column (Spark Dataframe)

Question

1 answers

solution1
0 2017-09-22 01:37:19

Finding Percent that are Different from Selected Value Within Specific Groupings for Each Column (Spark Dataframe)

Question

1 answers

solution1 0 2017-09-22 01:37:19

solution1
0 2017-09-22 01:37:19