[英]spark with column and aggregate function dropping other columns in the dataset
我有以下数据框,我已按id
、 txnId
和date
对以下数据框进行分组
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 1000|
| 153| 0000004512 | 30096| 11272020| 0|credit| 200|
| 145| 0000004513 | 30095| 11272020| 0| debit| 4000|
| 135| 0000004512 | 30096| 11272020| 0|credit| 2000|
| 153| 0000004512 | 30097| 11272020| 0| debit| 1000|
| 145| 0000004514 | 30094| 11272020| 0| debit| 1000|
+---------+--------------+-------+------------+---------+------+------+
所以分组后,output 是
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512| 30095| 11272020| 30| debit| 1000|
| 153| 0000004512| 30096| 11272020| 0|credit| 200|
| 153| 0000004512| 30097| 11272020| 0| debit| 1000|
| 153| 0000004512| 30097| 11272020| 0|credit| 500|
| 145| 0000004513| 30095| 11272020| 0| debit| 4000|
| 145| 0000004514| 30094| 11272020| 0| debit| 1000|
| 135| 0000004512| 30096| 11272020| 0|credit| 2000|
+---------+--------------+-------+------------+---------+------+------+
我需要在数据框中添加第三和第四列,这样它就是该组的贷方或借方类型的总金额,output 应该看起来像
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
| id| txnId|account| date| idl| type|amount|totalcredit|totaldebit|
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
| 153| 0000004512| 30095| 11272020| 30| debit| 1000| 0| 2000|
| 153| 0000004512| 30096| 11272020| 0|credit| 200| 700| 0|
| 153| 0000004512| 30097| 11272020| 0| debit| 1000| 0| 2000|
| 153| 0000004512| 30097| 11272020| 0|credit| 500| 700| 0|
| 145| 0000004513| 30095| 11272020| 0| debit| 4000| 0| 4000|
| 145| 0000004514| 30094| 11272020| 0|credit| 1000| 1000| 0|
| 135| 0000004512| 30096| 11272020| 0|credit| 2000| 2000| 0|
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
我已经编写了以下代码来添加新列
Dataset <Row> df3 = df2.where(df2.col("type").equalTo("credit"))
.groupBy("type")
.agg(sum("amount")).withColumnRenamed("sum(amount)", "totalcredit");
但它正在从数据集中删除其他列,如何保留数据集中的其他列?
您想对按id
分区的 Window 使用条件和聚合:
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import static org.apache.spark.sql.functions.*;
WindowSpec w = Window.partitionBy("id");
Dataset <Row> df3 = df2.withColumn(
"totalcredit",
when(
col("type").equalTo("credit"),
sum(when(col("type").equalTo("credit"), col("amount"))).over(w)
).otherwise(0)
).withColumn(
"totaldebit",
when(
col("type").equalTo("debit"),
sum(when(col("type").equalTo("debit"), col("amount"))).over(w)
).otherwise(0)
);
df3.show();
//+---+-----+-------+--------+---+------+------+-----------+----------+
//| id|txnId|account| date|idl| type|amount|totalcredit|totaldebit|
//+---+-----+-------+--------+---+------+------+-----------+----------+
//|145| 4513| 30095|11272020| 0| debit| 4000| 0| 5000|
//|145| 4514| 30094|11272020| 0| debit| 1000| 0| 5000|
//|135| 4512| 30096|11272020| 0|credit| 2000| 2000| 0|
//|153| 4512| 30095|11272020| 30| debit| 1000| 0| 2000|
//|153| 4512| 30096|11272020| 0|credit| 200| 700| 0|
//|153| 4512| 30097|11272020| 0| debit| 1000| 0| 2000|
//|153| 4512| 30097|11272020| 0|credit| 500| 700| 0|
//+---+-----+-------+--------+---+------+------+-----------+----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.