简体   繁体   English

使用嵌套列连接两个 spark Dataframe 并更新其中一列

[英]Join two spark Dataframe using the nested column and update one of the columns

I am working on some requirement in which I am getting one small table in from of CSV file as follow:我正在处理一些要求,其中我从 CSV 文件中获取一张小表格,如下所示:

root
 |-- ACCT_NO: string (nullable = true)
 |-- SUBID: integer (nullable = true)
 |-- MCODE: string (nullable = true)
 |-- NewClosedDate: timestamp (nullable = true

We also have a very big external hive table in form of Avro which is stored in HDFS as follow:我们还有一个非常大的 Avro 形式的外部 hive 表,它存储在 HDFS 中,如下所示:

root
-- accountlinks: array (nullable = true)
 |    |    |-- account: struct (nullable = true)
 |    |    |    |-- acctno: string (nullable = true)
 |    |    |    |-- subid: string (nullable = true)
 |    |    |    |-- mcode: string (nullable = true)
 |    |    |    |-- openeddate: string (nullable = true)
 |    |    |    |-- closeddate: string (nullable = true)

Now, the requirement is to look up the the external hive table based on the three columns from the csv file : ACCT_NO - SUBID - MCODE .现在,需要根据 csv 文件中的三列查找外部配置单元表: ACCT_NO - SUBID - MCODE If it matches, updates the accountlinks.account.closeddate with NewClosedDate from CSV file.如果匹配,更新accountlinks.account.closeddateNewClosedDate从CSV文件。

I have already written the following code to explode the required columns and join it with the small table but I am not really sure how to update the closeddate field ( this is currently null for all account holders) with NewClosedDate because closeddate is a nested column and I cannot easily use withColumn to populate it.我已经编写了以下代码来分解所需的列并将其与小表连接,但我不确定如何使用 NewClosedDate 更新 closeddate 字段(对于所有帐户持有人来说当前为 null),因为 closeddate 是一个嵌套列和我不能轻易使用 withColumn 来填充它。 In addition to that the schema and column names cannot be changed as these files are linked to some external hive table.除此之外,由于这些文件链接到某些外部配置单元表,因此无法更改架构和列名称。

 val df = spark.sql("select * from db.table where archive='201711'")

val ExtractedColumn = df 
.coalesce(150)
.withColumn("ACCT_NO", explode($"accountlinks.account.acctno"))
.withColumn("SUBID", explode($"accountlinks.account.acctsubid"))
.withColumn("MCODE", explode($"C.mcode"))

val ReferenceData = spark.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("file.csv")

val FinalData = ExtractedColumn.join(ReferenceData, Seq("ACCT_NO","SUBID","MCODE") , "left")

All you need is to explode the accountlinks array and then join the 2 dataframes like this:您所需要的只是accountlinks数组,然后像这样加入 2 个数据帧:

val explodedDF = df.withColumn("account", explode($"accountlinks"))
val joinCondition = $"ACCT_NO" === $"account.acctno" && $"SUBID" === $"account.subid" && $"MCODE" === $"account.mcode"
val joinDF  = explodedDF.join(ReferenceData, joinCondition, "left")

Now you can update the account struct column like below, and collect list to get back the array structure:现在您可以像下面这样更新account结构列,并收集列表以获取数组结构:

val FinalData = joinDF.withColumn("account", 
                                  struct($"account.acctno", $"account.subid", $"account.mcode", 
                                         $"account.openeddate", $"NewClosedDate".alias("closeddate")
                                        )
                                 )
                        .groupBy().agg(collect_list($"account").alias("accountlinks"))

The idea is to create a new struct with all the fields from account except closedate that you get from NewCloseDate column.我们的想法是创建一个新的结构与所有字段account ,除了closedate你从得到NewCloseDate列。

If the struct contains many fields you can use a for-comprehension to get all the fields except the close date to prevent typing them all.如果结构包含许多字段,您可以使用 for-comprehension 获取除关闭日期之外的所有字段,以防止将它们全部输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM