如何在 spark scala 中动态聚合列？

Question

我刚开始在spark-scala工作。 我有一个要求，我需要在 case 语句中找到几列的sum 。 我已经编写了相应的spark-sql代码，但无法在spark-scala中动态实现相同的代码。 以下是我想要实现的目标 -

SQL 代码-

Select  col_A,
        round(case when sum(amt_M)   <> 0.0 then sum(amt_M) 
                   when sum(amt_N)   <> 0.0 then sum(amt_N)
                   when sum(amt_P)   <> 0.0 then sum(amt_P) 
              end,1) as pct 
from table_T1
group by col_A

用例是从变量中获取某些列以动态实现上述case-statement逻辑。 话虽如此，但目前考虑到有 3 列，以后这个数字可能会增加。

下面是我试图在spark-scala中实现的代码——

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.collection._

val df = spark.table("database.table_T1")

val cols = "amt_M,amt_N,amt_P"

val aggCols = cols.split(",").toSeq

val sums = aggCols.map(colName => when(round(sum(colName).cast(DoubleType),1) =!= 0.0,sum(colName).cast(DoubleType).alias("sum_"+colName)))

val df2 = df.groupBy(col("col_A")).agg(sums.head, sums.tail:_*)

但是，这并没有给出预期的结果。 请帮我解决这个问题。

输入数据

+--------+--------------------+---------------------+----------------------+
|col_A   |amt_M               |amt_N                |amt_P                 |
+--------+--------------------+---------------------+----------------------+
|5C-SVS-1|0.0                 |0.04064912622009295  |1.6256888829356116E-4 |
|5C-SVS-1|0.0                 |0.026542159153759487 |8.574900251977566E-4  |
|5C-SVS-1|0.0                 |5.703894148377958E-5 |1.0745888408402782E-7 |
|5C-SVS-1|0.0                 |0.0                  |4.514561031069833E-4  |
|5C-SVS-1|0.0                 |0.011794053124022862 |0.0020388259536434656 |
|5C-SVS-1|0.0                 |7.55793849084569E-4  |0.0017105736019335327 |
|5C-SVS-1|0.0                 |0.019303776946698548 |2.240625765755109E-5  |
|5C-SVS-1|0.0                 |-8.028117213883126E-6|-2.1979360825171534E-6|
|5C-SVS-1|0.001940948839163001|0.029163686986129422 |0.09505621692309557   |
|5C-SVS-1|0.0                 |2.515835289984397E-7 |1.1486227577926157E-8 |
|5C-SVS-1|0.0                 |0.007844299114837874 |9.974187712854785E-4  |
|5C-SVS-1|0.0                 |5.033123682586349E-4 |1.3644443189731007E-4 |
|5C-SVS-1|0.0                 |0.026331681277001386 |6.022434166108063E-4  |
|5C-SVS-1|0.0                 |8.098023638080503E-6 |1.0                   |
|5C-SVS-1|0.0                 |0.03655893437209876  |0.003113370686486882  |
|5C-SVS-1|0.0                 |0.01409363925733864  |6.239415097038338E-4  |
|5C-SVS-1|0.0                 |0.02171856350557304  |0.0                   |
|5C-SVS-1|0.008435341548288601|0.03347191686227869  |0.35221710556006247   |
|5C-SVS-1|0.0                 |-2.547132732700875E-6|-0.13073525789233997  |
|5C-SVS-1|0.006057441518729214|0.024036273783621134 |0.21447606070652467   |
+--------+--------------------+---------------------+----------------------+

预期产出

+--------+---+
|   col_A|pct|
+--------+---+
|5C-SVS-1|1.0|
+--------+---+

谢谢

Answer 1

您可以先在groupBy上对您的 Dataframe 进行col_A ，计算总和，然后使用map操作来选择您想要随身携带的总和。 是这样的：

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

// Creating the necessary schema to control the types read in when reading in our CSV
val schema = new StructType()
    .add("col_A", StringType)
    .add("amt_M", DoubleType)
    .add("amt_N", DoubleType)
    .add("amt_P", DoubleType)

// Reading in the Dataframe using our premade schema. I put the data in a CSV
// file with ; as delimiters.
val df = spark.read
    .option("header", "true")
    .option("sep",";")
    .schema(schema)
    .csv("./someData.csv")

df.show
+--------+--------------------+--------------------+--------------------+                                                                                                                                                                                                       
|   col_A|               amt_M|               amt_N|               amt_P|                                                                                                                                                                                                       
+--------+--------------------+--------------------+--------------------+                                                                                                                                                                                                       
|5C-SVS-1|                 0.0| 0.04064912622009295|1.625688882935611...|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|0.026542159153759487|8.574900251977566E-4|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|5.703894148377958E-5|1.074588840840278...|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|                 0.0|4.514561031069833E-4|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|0.011794053124022862|0.002038825953643...|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0| 7.55793849084569E-4|0.001710573601933...|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|0.019303776946698548|2.240625765755109E-5|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|-8.02811721388312...|-2.19793608251715...|                                                                                                                                                                                                       
|5C-SVS-1|0.001940948839163001|0.029163686986129422| 0.09505621692309557|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|2.515835289984397E-7|1.148622757792615...|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|0.007844299114837874|9.974187712854785E-4|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|5.033123682586349E-4|1.364444318973100...|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|0.026331681277001386|6.022434166108063E-4|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|8.098023638080503E-6|                 1.0|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0| 0.03655893437209876|0.003113370686486882|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0| 0.01409363925733864|6.239415097038338E-4|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0| 0.02171856350557304|                 0.0|                                                                                                                                                                                                       
|5C-SVS-1|0.008435341548288601| 0.03347191686227869| 0.35221710556006247|                                                                                                                                                                                                       
|5C-SVS-1|                 0.0|-2.54713273270087...|-0.13073525789233997|                                                                                                                                                                                                       
|5C-SVS-1|0.006057441518729214|0.024036273783621134| 0.21447606070652467|                                                                                                                                                                                                       
+--------+--------------------+--------------------+--------------------+

// Aggregating our data for each distinct value in col_A, summing all the amt columns
val aggregated_df = df.groupBy(col("col_A"))
    .agg(
        round(sum(col("amt_M")).as("amt_M_sum"), 1),
        round(sum(col("amt_N")).as("amt_N_sum"), 1),
        round(sum(col("amt_P")).as("amt_P_sum"), 1)
)

aggregated_df.show                                                                                                                                                                                                                                                       
+--------+---------------------------------+---------------------------------+---------------------------------+                                                                                                                                                                
|   col_A|round(sum(amt_M) AS amt_M_sum, 1)|round(sum(amt_N) AS amt_N_sum, 1)|round(sum(amt_P) AS amt_P_sum, 1)|                                                                                                                                                                
+--------+---------------------------------+---------------------------------+---------------------------------+                                                                                                                                                                
|5C-SVS-1|                              0.0|                              0.3|                              1.5|                                                                                                                                                                
+--------+---------------------------------+---------------------------------+---------------------------------+


// Selecting our wanted values. We make use of Scala pattern matching here to
// easily deconstruct our data and make something readable
val output = aggregated_df.map(
    row => row match {
        case Row(col_A: String, sum_amt_M: Double, sum_amt_N: Double, sum_amt_P: Double) => {
            if (sum_amt_M != 0.0)
                (col_A, sum_amt_M)
            else if (sum_amt_N != 0.0)
                (col_A, sum_amt_N)
            else
                (col_A, sum_amt_P)
        }
    }
).toDF("col_A", "pct")

output.show                                                                                                                                                                                                                                                              
+--------+---+                                                                                                                                                                                                                                                                  
|   col_A|pct|                                                                                                                                                                                                                                                                  
+--------+---+                                                                                                                                                                                                                                                                  
|5C-SVS-1|0.3|                                                                                                                                                                                                                                                                  
+--------+---+

注意：如果所有的和 == 0，你会怎么做？ 这由您决定：我将sum_amt_P的值作为else包罗万象的情况。 但从这里开始，您只需调整map函数内部的逻辑即可获得您想要的任何内容。

希望这可以帮助！

Answer 2

我通过实施以下方法解决了这个要求 -

import org.apache.spark.sql.types._

def getSumCols(columnList: List[String]): Column = {

// Storing the value for the 1st index 

    var conditionColumn: Column = when(sum(col(columnList(0)).cast(DoubleType)) =!= 0.0, sum(col(columnList(0)).cast(DoubleType)))

// Iterating through the 2nd element till end and appending to existing variable created in the 1st step

    for(c <- 1 to columnList.length -1){
        conditionColumn = conditionColumn.when( sum(col(columnList(c)).cast(DoubleType)) =!= 0.0, sum(col(columnList(c)).cast(DoubleType)) )
    }
    round(conditionColumn,1)
}

现在，一旦在聚合期间调用它，如下所示 -

val cols = "amt_M,amt_N,amt_P"

val colList = cols.split(",").toList

val conditionColumn: Column = getSumCols(colList)

val df1 = df.groupBy("col_A").agg(conditionColumn.alias("pct"))

如何在 spark scala 中动态聚合列？

问题描述

2 个解决方案

解决方案1
0 2022-12-21 10:57:34

解决方案2
0 2022-12-21 18:24:54

如何在 spark scala 中动态聚合列？

问题描述

2 个解决方案

解决方案1 0 2022-12-21 10:57:34

解决方案2 0 2022-12-21 18:24:54

解决方案1
0 2022-12-21 10:57:34

解决方案2
0 2022-12-21 18:24:54