[英]spark with column and aggregate function dropping other columns in the dataset
我有以下數據框,我已按id
、 txnId
和date
對以下數據框進行分組
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512 | 30095| 11272020| 30| debit| 1000|
| 153| 0000004512 | 30096| 11272020| 0|credit| 200|
| 145| 0000004513 | 30095| 11272020| 0| debit| 4000|
| 135| 0000004512 | 30096| 11272020| 0|credit| 2000|
| 153| 0000004512 | 30097| 11272020| 0| debit| 1000|
| 145| 0000004514 | 30094| 11272020| 0| debit| 1000|
+---------+--------------+-------+------------+---------+------+------+
所以分組后,output 是
+---------+--------------+-------+------------+---------+------+------+
| id| txnId|account| date| idl| type|amount|
+---------+--------------+-------+------------+---------+------+------+
| 153| 0000004512| 30095| 11272020| 30| debit| 1000|
| 153| 0000004512| 30096| 11272020| 0|credit| 200|
| 153| 0000004512| 30097| 11272020| 0| debit| 1000|
| 153| 0000004512| 30097| 11272020| 0|credit| 500|
| 145| 0000004513| 30095| 11272020| 0| debit| 4000|
| 145| 0000004514| 30094| 11272020| 0| debit| 1000|
| 135| 0000004512| 30096| 11272020| 0|credit| 2000|
+---------+--------------+-------+------------+---------+------+------+
我需要在數據框中添加第三和第四列,這樣它就是該組的貸方或借方類型的總金額,output 應該看起來像
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
| id| txnId|account| date| idl| type|amount|totalcredit|totaldebit|
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
| 153| 0000004512| 30095| 11272020| 30| debit| 1000| 0| 2000|
| 153| 0000004512| 30096| 11272020| 0|credit| 200| 700| 0|
| 153| 0000004512| 30097| 11272020| 0| debit| 1000| 0| 2000|
| 153| 0000004512| 30097| 11272020| 0|credit| 500| 700| 0|
| 145| 0000004513| 30095| 11272020| 0| debit| 4000| 0| 4000|
| 145| 0000004514| 30094| 11272020| 0|credit| 1000| 1000| 0|
| 135| 0000004512| 30096| 11272020| 0|credit| 2000| 2000| 0|
+---------+--------------+-------+-----------+---------+------+------+-----------+-----+
我已經編寫了以下代碼來添加新列
Dataset <Row> df3 = df2.where(df2.col("type").equalTo("credit"))
.groupBy("type")
.agg(sum("amount")).withColumnRenamed("sum(amount)", "totalcredit");
但它正在從數據集中刪除其他列,如何保留數據集中的其他列?
您想對按id
分區的 Window 使用條件和聚合:
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import static org.apache.spark.sql.functions.*;
WindowSpec w = Window.partitionBy("id");
Dataset <Row> df3 = df2.withColumn(
"totalcredit",
when(
col("type").equalTo("credit"),
sum(when(col("type").equalTo("credit"), col("amount"))).over(w)
).otherwise(0)
).withColumn(
"totaldebit",
when(
col("type").equalTo("debit"),
sum(when(col("type").equalTo("debit"), col("amount"))).over(w)
).otherwise(0)
);
df3.show();
//+---+-----+-------+--------+---+------+------+-----------+----------+
//| id|txnId|account| date|idl| type|amount|totalcredit|totaldebit|
//+---+-----+-------+--------+---+------+------+-----------+----------+
//|145| 4513| 30095|11272020| 0| debit| 4000| 0| 5000|
//|145| 4514| 30094|11272020| 0| debit| 1000| 0| 5000|
//|135| 4512| 30096|11272020| 0|credit| 2000| 2000| 0|
//|153| 4512| 30095|11272020| 30| debit| 1000| 0| 2000|
//|153| 4512| 30096|11272020| 0|credit| 200| 700| 0|
//|153| 4512| 30097|11272020| 0| debit| 1000| 0| 2000|
//|153| 4512| 30097|11272020| 0|credit| 500| 700| 0|
//+---+-----+-------+--------+---+------+------+-----------+----------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.