[英]Finding Percent that are Different from Selected Value Within Specific Groupings for Each Column (Spark Dataframe)
Hi all so I have an interesting and difficult problem. 大家好,所以我遇到了一个有趣且困难的问题。
Imagine a Spark dataframe like so: 想象一下这样的Spark数据帧:
A B C D E
1 q 2 3 4
1 t 5 3 5
1 r 1 2 5
2 r 3 1 3
2 t 8 1 3
2 q 1 2 3
3 t 1 1 2
3 r 2 1 2
3 r 3 1 1
Now I have a quite complex problem. 现在我有一个非常复杂的问题。
First I want to group by column A. Then I want to find the argmax for column C where column B is equal to r. 首先,我想按列A分组。然后,我要找到列B等于r的列C的argmax。 Then, I want to groupby again for each B not equal to R. Then, I want to compare all other values in the groupby to the 'maximal' value selected previously, for each subsequent column (D and E), and find the percentage that match and the counts. 然后,我想再次对不等于R的每个B进行分组。然后,我想将groupby中的所有其他值与先前为每个后续列(D和E)选择的“最大值”值进行比较,并找到百分比匹配和计数。
Thus, the output will be: 因此,输出将是:
A B TotalCount Percent-D-Match Count-D-Match Percent-E-Match Count-E-Match
1 q 1 0 0 0 0
1 t 1 0 0 1 1
2 q 1 0 0 1 1
2 t 1 1 1 1 1
3 t 1 1 1 0 0
I imagine this will be a complex udaf but I'm unsure how to even approach this. 我想这将是一个复杂的udaf,但是我不确定如何解决这个问题。 Thanks. 谢谢。
According to what I understood from your question, you can use the following logic 根据您对问题的理解,可以使用以下逻辑
first step would be to calculate two temporary dataframe
s for maxR and maxNotR 第一步是为maxR和maxNotR计算两个临时dataframe
val maxR = df.filter($"B" === "r").groupBy("A").agg(max("C").as("maxR"))
val maxNotR = df.filter($"B" =!= "r").groupBy("A").agg(max("C").as("maxNotR"))
Next step would be to join
them with the original dataframe
下一步将是join
与他们原来的dataframe
val joinedDF = df.join(maxR, Seq("A"), "left").join(maxNotR, Seq("A"), "left")
Since you don't need the rows with r
in column B
, you can filter
them and generate the TotalCount
column 由于您不需要B
列中带有r
的行,因此可以对其进行filter
并生成TotalCount
列
val dff = joinedDF.filter($"B" =!= "r").groupBy("A", "B", "D", "E", "maxR", "maxNotR").agg(count("B").as("TotalCount"))
Final step would be to calculate the expected output by comparing the columns 最后一步是通过比较各列来计算预期输出
dff.select($"A",
$"B",
$"TotalCount",
when($"D" === $"maxR" || $"D" === $"maxNotR", 1).otherwise(0).as("Percent-D-Match"),
(when($"D" === $"maxR", 1).otherwise(0)+when($"D" === $"maxNotR", 1).otherwise(0)).as("Count-D-Match"),
when($"E" === $"maxR" || $"E" === $"maxNotR", 1).otherwise(0).as("Percent-E-Match"),
(when($"E" === $"maxR", 1).otherwise(0)+when($"E" === $"maxNotR", 1).otherwise(0)).as("Count-E-Match")
)
This would lead you to final dataframe
as 这将导致您最终的dataframe
为
+---+---+----------+---------------+-------------+---------------+-------------+
|A |B |TotalCount|Percent-D-Match|Count-D-Match|Percent-E-Match|Count-E-Match|
+---+---+----------+---------------+-------------+---------------+-------------+
|1 |q |1 |0 |0 |0 |0 |
|2 |t |1 |0 |0 |1 |1 |
|3 |t |1 |1 |1 |0 |0 |
|2 |q |1 |0 |0 |1 |1 |
|1 |t |1 |0 |0 |1 |1 |
+---+---+----------+---------------+-------------+---------------+-------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.