火花与列和聚合 function 删除数据集中的其他列

Question

我有以下数据框，我已按id 、 txnId和date对以下数据框进行分组

+---------+--------------+-------+------------+---------+------+------+
|       id|         txnId|account|        date|      idl|  type|amount|
+---------+--------------+-------+------------+---------+------+------+
|      153|   0000004512 |  30095|    11272020|       30| debit|  1000|
|      153|   0000004512 |  30096|    11272020|        0|credit|   200|
|      145|   0000004513 |  30095|    11272020|        0| debit|  4000|
|      135|   0000004512 |  30096|    11272020|        0|credit|  2000|
|      153|   0000004512 |  30097|    11272020|        0| debit|  1000|
|      145|   0000004514 |  30094|    11272020|        0| debit|  1000|
+---------+--------------+-------+------------+---------+------+------+

所以分组后，output 是

+---------+--------------+-------+------------+---------+------+------+
|       id|         txnId|account|        date|      idl|  type|amount|
+---------+--------------+-------+------------+---------+------+------+
|      153|    0000004512|  30095|    11272020|       30| debit|  1000|
|      153|    0000004512|  30096|    11272020|        0|credit|   200|
|      153|    0000004512|  30097|    11272020|        0| debit|  1000|
|      153|    0000004512|  30097|    11272020|        0|credit|   500|
|      145|    0000004513|  30095|    11272020|        0| debit|  4000|
|      145|    0000004514|  30094|    11272020|        0| debit|  1000|
|      135|    0000004512|  30096|    11272020|        0|credit|  2000|
+---------+--------------+-------+------------+---------+------+------+

我需要在数据框中添加第三和第四列，这样它就是该组的贷方或借方类型的总金额，output 应该看起来像

+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
|       id|         txnId|account|       date|      idl|  type|amount|totalcredit|totaldebit|
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
|      153|    0000004512|  30095|   11272020|       30| debit|  1000|          0| 2000|
|      153|    0000004512|  30096|   11272020|        0|credit|   200|        700|    0|
|      153|    0000004512|  30097|   11272020|        0| debit|  1000|          0| 2000|
|      153|    0000004512|  30097|   11272020|        0|credit|   500|        700|    0|
|      145|    0000004513|  30095|   11272020|        0| debit|  4000|          0| 4000|
|      145|    0000004514|  30094|   11272020|        0|credit|  1000|       1000|    0|
|      135|    0000004512|  30096|   11272020|        0|credit|  2000|       2000|    0|
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+

我已经编写了以下代码来添加新列

Dataset <Row> df3 = df2.where(df2.col("type").equalTo("credit"))
    .groupBy("type")
    .agg(sum("amount")).withColumnRenamed("sum(amount)", "totalcredit");

但它正在从数据集中删除其他列，如何保留数据集中的其他列？

Answer 1

您想对按id分区的 Window 使用条件和聚合：

import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;

import static org.apache.spark.sql.functions.*;


WindowSpec w = Window.partitionBy("id");

Dataset <Row> df3 = df2.withColumn(
    "totalcredit",
    when(
        col("type").equalTo("credit"),
        sum(when(col("type").equalTo("credit"), col("amount"))).over(w)
    ).otherwise(0)
).withColumn(
    "totaldebit",
    when(
        col("type").equalTo("debit"),
        sum(when(col("type").equalTo("debit"), col("amount"))).over(w)
    ).otherwise(0)
);


df3.show();

//+---+-----+-------+--------+---+------+------+-----------+----------+
//| id|txnId|account|    date|idl|  type|amount|totalcredit|totaldebit|
//+---+-----+-------+--------+---+------+------+-----------+----------+
//|145| 4513|  30095|11272020|  0| debit|  4000|          0|      5000|
//|145| 4514|  30094|11272020|  0| debit|  1000|          0|      5000|
//|135| 4512|  30096|11272020|  0|credit|  2000|       2000|         0|
//|153| 4512|  30095|11272020| 30| debit|  1000|          0|      2000|
//|153| 4512|  30096|11272020|  0|credit|   200|        700|         0|
//|153| 4512|  30097|11272020|  0| debit|  1000|          0|      2000|
//|153| 4512|  30097|11272020|  0|credit|   500|        700|         0|
//+---+-----+-------+--------+---+------+------+-----------+----------+

火花与列和聚合 function 删除数据集中的其他列

问题描述

1 个解决方案

解决方案1
0 2021-11-28 16:27:58

火花与列和聚合 function 删除数据集中的其他列

问题描述

1 个解决方案

解决方案1 0 2021-11-28 16:27:58

解决方案1
0 2021-11-28 16:27:58