火花與列和聚合 function 刪除數據集中的其他列

Question

我有以下數據框，我已按id 、 txnId和date對以下數據框進行分組

+---------+--------------+-------+------------+---------+------+------+
|       id|         txnId|account|        date|      idl|  type|amount|
+---------+--------------+-------+------------+---------+------+------+
|      153|   0000004512 |  30095|    11272020|       30| debit|  1000|
|      153|   0000004512 |  30096|    11272020|        0|credit|   200|
|      145|   0000004513 |  30095|    11272020|        0| debit|  4000|
|      135|   0000004512 |  30096|    11272020|        0|credit|  2000|
|      153|   0000004512 |  30097|    11272020|        0| debit|  1000|
|      145|   0000004514 |  30094|    11272020|        0| debit|  1000|
+---------+--------------+-------+------------+---------+------+------+

所以分組后，output 是

+---------+--------------+-------+------------+---------+------+------+
|       id|         txnId|account|        date|      idl|  type|amount|
+---------+--------------+-------+------------+---------+------+------+
|      153|    0000004512|  30095|    11272020|       30| debit|  1000|
|      153|    0000004512|  30096|    11272020|        0|credit|   200|
|      153|    0000004512|  30097|    11272020|        0| debit|  1000|
|      153|    0000004512|  30097|    11272020|        0|credit|   500|
|      145|    0000004513|  30095|    11272020|        0| debit|  4000|
|      145|    0000004514|  30094|    11272020|        0| debit|  1000|
|      135|    0000004512|  30096|    11272020|        0|credit|  2000|
+---------+--------------+-------+------------+---------+------+------+

我需要在數據框中添加第三和第四列，這樣它就是該組的貸方或借方類型的總金額，output 應該看起來像

+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
|       id|         txnId|account|       date|      idl|  type|amount|totalcredit|totaldebit|
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
|      153|    0000004512|  30095|   11272020|       30| debit|  1000|          0| 2000|
|      153|    0000004512|  30096|   11272020|        0|credit|   200|        700|    0|
|      153|    0000004512|  30097|   11272020|        0| debit|  1000|          0| 2000|
|      153|    0000004512|  30097|   11272020|        0|credit|   500|        700|    0|
|      145|    0000004513|  30095|   11272020|        0| debit|  4000|          0| 4000|
|      145|    0000004514|  30094|   11272020|        0|credit|  1000|       1000|    0|
|      135|    0000004512|  30096|   11272020|        0|credit|  2000|       2000|    0|
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+

我已經編寫了以下代碼來添加新列

Dataset <Row> df3 = df2.where(df2.col("type").equalTo("credit"))
    .groupBy("type")
    .agg(sum("amount")).withColumnRenamed("sum(amount)", "totalcredit");

但它正在從數據集中刪除其他列，如何保留數據集中的其他列？

Answer 1

您想對按id分區的 Window 使用條件和聚合：

import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;

import static org.apache.spark.sql.functions.*;


WindowSpec w = Window.partitionBy("id");

Dataset <Row> df3 = df2.withColumn(
    "totalcredit",
    when(
        col("type").equalTo("credit"),
        sum(when(col("type").equalTo("credit"), col("amount"))).over(w)
    ).otherwise(0)
).withColumn(
    "totaldebit",
    when(
        col("type").equalTo("debit"),
        sum(when(col("type").equalTo("debit"), col("amount"))).over(w)
    ).otherwise(0)
);


df3.show();

//+---+-----+-------+--------+---+------+------+-----------+----------+
//| id|txnId|account|    date|idl|  type|amount|totalcredit|totaldebit|
//+---+-----+-------+--------+---+------+------+-----------+----------+
//|145| 4513|  30095|11272020|  0| debit|  4000|          0|      5000|
//|145| 4514|  30094|11272020|  0| debit|  1000|          0|      5000|
//|135| 4512|  30096|11272020|  0|credit|  2000|       2000|         0|
//|153| 4512|  30095|11272020| 30| debit|  1000|          0|      2000|
//|153| 4512|  30096|11272020|  0|credit|   200|        700|         0|
//|153| 4512|  30097|11272020|  0| debit|  1000|          0|      2000|
//|153| 4512|  30097|11272020|  0|credit|   500|        700|         0|
//+---+-----+-------+--------+---+------+------+-----------+----------+

火花與列和聚合 function 刪除數據集中的其他列

問題描述

1 個解決方案

解決方案1
0 2021-11-28 16:27:58

火花與列和聚合 function 刪除數據集中的其他列

問題描述

1 個解決方案

解決方案1 0 2021-11-28 16:27:58

解決方案1
0 2021-11-28 16:27:58