在每列的特定组中查找与选定值不同的百分比（Spark数据框）

Question

Hi all so I have an interesting and difficult problem. 大家好，所以我遇到了一个有趣且困难的问题。

Imagine a Spark dataframe like so: 想象一下这样的Spark数据帧：

A  B  C  D  E
1  q  2  3  4
1  t  5  3  5
1  r  1  2  5
2  r  3  1  3
2  t  8  1  3
2  q  1  2  3
3  t  1  1  2
3  r  2  1  2
3  r  3  1  1

Now I have a quite complex problem. 现在我有一个非常复杂的问题。

First I want to group by column A. Then I want to find the argmax for column C where column B is equal to r. 首先，我想按列A分组。然后，我要找到列B等于r的列C的argmax。 Then, I want to groupby again for each B not equal to R. Then, I want to compare all other values in the groupby to the 'maximal' value selected previously, for each subsequent column (D and E), and find the percentage that match and the counts. 然后，我想再次对不等于R的每个B进行分组。然后，我想将groupby中的所有其他值与先前为每个后续列（D和E）选择的“最大值”值进行比较，并找到百分比匹配和计数。

Thus, the output will be: 因此，输出将是：

A  B  TotalCount  Percent-D-Match  Count-D-Match  Percent-E-Match  Count-E-Match
1  q  1           0                0              0                0
1  t  1           0                0              1                1
2  q  1           0                0              1                1
2  t  1           1                1              1                1
3  t  1           1                1              0                0

I imagine this will be a complex udaf but I'm unsure how to even approach this. 我想这将是一个复杂的udaf，但是我不确定如何解决这个问题。 Thanks. 谢谢。

Answer 1

According to what I understood from your question, you can use the following logic 根据您对问题的理解，可以使用以下逻辑

first step would be to calculate two temporary dataframe s for maxR and maxNotR 第一步是为maxR和maxNotR计算两个临时dataframe

val maxR = df.filter($"B" === "r").groupBy("A").agg(max("C").as("maxR"))
val maxNotR = df.filter($"B" =!= "r").groupBy("A").agg(max("C").as("maxNotR"))

Next step would be to join them with the original dataframe 下一步将是join与他们原来的dataframe

val joinedDF = df.join(maxR, Seq("A"), "left").join(maxNotR, Seq("A"), "left")

Since you don't need the rows with r in column B , you can filter them and generate the TotalCount column 由于您不需要B列中带有r的行，因此可以对其进行filter并生成TotalCount列

val dff = joinedDF.filter($"B" =!= "r").groupBy("A", "B", "D", "E", "maxR", "maxNotR").agg(count("B").as("TotalCount"))

Final step would be to calculate the expected output by comparing the columns 最后一步是通过比较各列来计算预期输出

  dff.select($"A",
      $"B",
      $"TotalCount",
      when($"D" === $"maxR" || $"D" === $"maxNotR", 1).otherwise(0).as("Percent-D-Match"),
      (when($"D" === $"maxR", 1).otherwise(0)+when($"D" === $"maxNotR", 1).otherwise(0)).as("Count-D-Match"),
      when($"E" === $"maxR" || $"E" === $"maxNotR", 1).otherwise(0).as("Percent-E-Match"),
      (when($"E" === $"maxR", 1).otherwise(0)+when($"E" === $"maxNotR", 1).otherwise(0)).as("Count-E-Match")
    )

This would lead you to final dataframe as 这将导致您最终的dataframe为

    +---+---+----------+---------------+-------------+---------------+-------------+
|A  |B  |TotalCount|Percent-D-Match|Count-D-Match|Percent-E-Match|Count-E-Match|
+---+---+----------+---------------+-------------+---------------+-------------+
|1  |q  |1         |0              |0            |0              |0            |
|2  |t  |1         |0              |0            |1              |1            |
|3  |t  |1         |1              |1            |0              |0            |
|2  |q  |1         |0              |0            |1              |1            |
|1  |t  |1         |0              |0            |1              |1            |
+---+---+----------+---------------+-------------+---------------+-------------+

在每列的特定组中查找与选定值不同的百分比（Spark数据框）

问题描述

1 个解决方案

解决方案1
0 2017-09-22 01:37:19

在每列的特定组中查找与选定值不同的百分比（Spark数据框）

问题描述

1 个解决方案

解决方案1 0 2017-09-22 01:37:19

解决方案1
0 2017-09-22 01:37:19