Hi all so I have an interesting and difficult problem.
Imagine a Spark dataframe like so:
A B C D E
1 q 2 3 4
1 t 5 3 5
1 r 1 2 5
2 r 3 1 3
2 t 8 1 3
2 q 1 2 3
3 t 1 1 2
3 r 2 1 2
3 r 3 1 1
Now I have a quite complex problem.
First I want to group by column A. Then I want to find the argmax for column C where column B is equal to r. Then, I want to groupby again for each B not equal to R. Then, I want to compare all other values in the groupby to the 'maximal' value selected previously, for each subsequent column (D and E), and find the percentage that match and the counts.
Thus, the output will be:
A B TotalCount Percent-D-Match Count-D-Match Percent-E-Match Count-E-Match
1 q 1 0 0 0 0
1 t 1 0 0 1 1
2 q 1 0 0 1 1
2 t 1 1 1 1 1
3 t 1 1 1 0 0
I imagine this will be a complex udaf but I'm unsure how to even approach this. Thanks.
According to what I understood from your question, you can use the following logic
first step would be to calculate two temporary dataframe
s for maxR and maxNotR
val maxR = df.filter($"B" === "r").groupBy("A").agg(max("C").as("maxR"))
val maxNotR = df.filter($"B" =!= "r").groupBy("A").agg(max("C").as("maxNotR"))
Next step would be to join
them with the original dataframe
val joinedDF = df.join(maxR, Seq("A"), "left").join(maxNotR, Seq("A"), "left")
Since you don't need the rows with r
in column B
, you can filter
them and generate the TotalCount
column
val dff = joinedDF.filter($"B" =!= "r").groupBy("A", "B", "D", "E", "maxR", "maxNotR").agg(count("B").as("TotalCount"))
Final step would be to calculate the expected output by comparing the columns
dff.select($"A",
$"B",
$"TotalCount",
when($"D" === $"maxR" || $"D" === $"maxNotR", 1).otherwise(0).as("Percent-D-Match"),
(when($"D" === $"maxR", 1).otherwise(0)+when($"D" === $"maxNotR", 1).otherwise(0)).as("Count-D-Match"),
when($"E" === $"maxR" || $"E" === $"maxNotR", 1).otherwise(0).as("Percent-E-Match"),
(when($"E" === $"maxR", 1).otherwise(0)+when($"E" === $"maxNotR", 1).otherwise(0)).as("Count-E-Match")
)
This would lead you to final dataframe
as
+---+---+----------+---------------+-------------+---------------+-------------+
|A |B |TotalCount|Percent-D-Match|Count-D-Match|Percent-E-Match|Count-E-Match|
+---+---+----------+---------------+-------------+---------------+-------------+
|1 |q |1 |0 |0 |0 |0 |
|2 |t |1 |0 |0 |1 |1 |
|3 |t |1 |1 |1 |0 |0 |
|2 |q |1 |0 |0 |1 |1 |
|1 |t |1 |0 |0 |1 |1 |
+---+---+----------+---------------+-------------+---------------+-------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.