简体   繁体   English

使用 spark/scala 将客户和帐户数据映射到案例类

[英]map customer and account data to a case class using spark/scala

So I have a case class customer data and a case class customer account data as follow:所以我有一个案例类客户数据和一个案例类客户帐户数据如下:

case class CustomerData(
                      customerId: String,
                      forename: String,
                      surname: String
                    )
 case class AccountData(
                      customerId: String,
                      accountId: String,
                      balance: Long
                    )

I need to join these two to get them to form the following case class:我需要加入这两个以让它们形成以下案例类:

case class CustomerAccountOutput(
                                customerId: String,
                                forename: String,
                                surname: String,
                                //Accounts for this customer
                                accounts: Seq[AccountData],
                                //Statistics of the accounts
                                numberAccounts: Int,
                                totalBalance: Long,
                                averageBalance: Double
                              )

I need to show that if null is appearing in accountsId or balance thennumber of accounts is 0, total balance as null and avg balance also as null.我需要证明,如果accountId 或余额中出现空值,则帐户数为0,总余额为空,平均余额也为空。 replacing the null with 0 is also accepted.也接受用 0 替换 null。

The final result should be something like this:最终结果应该是这样的:

+----------+-----------+--------+---------------------------------------------------------------------+--------------+------------+-----------------+
|customerId|forename   |surname |accounts                                                             |numberAccounts|totalBalance|averageBalance   |
+----------+-----------+--------+---------------------------------------------------------------------+--------------+------------+-----------------+
|IND0113   |Leonard    |Ball    |[[IND0113,ACC0577,531]]                                              |1             |531         |531.0            |
|IND0277   |Victoria   |Hodges  |[[IND0277,null,null]]                                                |0             |null        |null             |
|IND0055   |Ella       |Taylor  |[[IND0055,ACC0156,137], [IND0055,ACC0117,148]]                       |2             |285         |142.5            |
|IND0129   |Christopher|Young   |[[IND0129,null,null]]                                                |0             |null   

I have already got the two case classes to join and here is the code:我已经加入了两个案例类,这是代码:

val customerDS = customerDF.as[CustomerData]
  val accountDS = accountDF.withColumn("balance",'balance.cast("long")).as[AccountData]
  //END GIVEN CODE

  val customerAccountsDS = customerDF.join(accountDF,customerDF("customerID") === accountDF("customerID"),"leftouter").drop(accountDF.col("customerId"))

How do i go about getting the above result?我如何获得上述结果?

You should be able to do it by using concat_ws and collect_list functions in spark.您应该可以通过在 spark 中使用concat_wscollect_list函数来做到这一点。

//Creating sample data
case class CustomerData(
                      customerId: String,
                      forename: String,
                      surname: String
                    )
 case class AccountData(
                      customerId: String,
                      accountId: String,
                      balance: Long
                    )
val customercolumns = Seq("customerId","forename","surname")
val acccolumns = Seq("customerId","accountId","balance")
val custdata = Seq(("IND0113", "Leonard","Ball"), ("IND0277", "Victoria","Hodges"), ("IND0055", "Ella","Taylor"),("IND0129","Christopher","Young")).toDF(customercolumns:_*).as[CustomerData]
val acctdata = Seq(("IND0113","ACC0577",531),("IND0055","ACC0156",137),("IND0055","ACC0117",148)).toDF(acccolumns:_*).as[AccountData]
val customerAccountsDS = custdata.join(acctdata,custdata("customerID") === acctdata("customerID"),"leftouter").drop(acctdata.col("customerId"))
import org.apache.spark.sql.functions._
val result = customerAccountsDS.withColumn("accounts", concat_ws(",", $"customerId", $"accountId",$"balance"))
val finalresult = result.groupBy("customerId","forename","surname").agg(collect_list($"accounts"))

You can see the output as below :您可以看到如下输出: 在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM