[英]map customer and account data to a case class using spark/scala
所以我有一个案例类客户数据和一个案例类客户帐户数据如下:
case class CustomerData(
customerId: String,
forename: String,
surname: String
)
case class AccountData(
customerId: String,
accountId: String,
balance: Long
)
我需要加入这两个以让它们形成以下案例类:
case class CustomerAccountOutput(
customerId: String,
forename: String,
surname: String,
//Accounts for this customer
accounts: Seq[AccountData],
//Statistics of the accounts
numberAccounts: Int,
totalBalance: Long,
averageBalance: Double
)
我需要证明,如果accountId 或余额中出现空值,则帐户数为0,总余额为空,平均余额也为空。 也接受用 0 替换 null。
最终结果应该是这样的:
+----------+-----------+--------+---------------------------------------------------------------------+--------------+------------+-----------------+
|customerId|forename |surname |accounts |numberAccounts|totalBalance|averageBalance |
+----------+-----------+--------+---------------------------------------------------------------------+--------------+------------+-----------------+
|IND0113 |Leonard |Ball |[[IND0113,ACC0577,531]] |1 |531 |531.0 |
|IND0277 |Victoria |Hodges |[[IND0277,null,null]] |0 |null |null |
|IND0055 |Ella |Taylor |[[IND0055,ACC0156,137], [IND0055,ACC0117,148]] |2 |285 |142.5 |
|IND0129 |Christopher|Young |[[IND0129,null,null]] |0 |null
我已经加入了两个案例类,这是代码:
val customerDS = customerDF.as[CustomerData]
val accountDS = accountDF.withColumn("balance",'balance.cast("long")).as[AccountData]
//END GIVEN CODE
val customerAccountsDS = customerDF.join(accountDF,customerDF("customerID") === accountDF("customerID"),"leftouter").drop(accountDF.col("customerId"))
我如何获得上述结果?
您应该可以通过在 spark 中使用concat_ws
和collect_list
函数来做到这一点。
//Creating sample data
case class CustomerData(
customerId: String,
forename: String,
surname: String
)
case class AccountData(
customerId: String,
accountId: String,
balance: Long
)
val customercolumns = Seq("customerId","forename","surname")
val acccolumns = Seq("customerId","accountId","balance")
val custdata = Seq(("IND0113", "Leonard","Ball"), ("IND0277", "Victoria","Hodges"), ("IND0055", "Ella","Taylor"),("IND0129","Christopher","Young")).toDF(customercolumns:_*).as[CustomerData]
val acctdata = Seq(("IND0113","ACC0577",531),("IND0055","ACC0156",137),("IND0055","ACC0117",148)).toDF(acccolumns:_*).as[AccountData]
val customerAccountsDS = custdata.join(acctdata,custdata("customerID") === acctdata("customerID"),"leftouter").drop(acctdata.col("customerId"))
import org.apache.spark.sql.functions._
val result = customerAccountsDS.withColumn("accounts", concat_ws(",", $"customerId", $"accountId",$"balance"))
val finalresult = result.groupBy("customerId","forename","surname").agg(collect_list($"accounts"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.