[英]Scala, convert list of dataframe into single dataframe then merge it with another dataframe with a specific column
[英]Convert List of dataframes into single dataframe with specific columns in Scala
我正在嘗試將數據框列表轉換為單個數據框,如下所示
其中dfList是List [sql.Dataframe]
dfList=List([ID: bigint, A: string], [ID: bigint, B: string], [ID: bigint, C: string], [ID: bigint, D: string])
dfList = List( +--------+-------------+ +--------+-------------+ +--------+--------+ +--------+--------+
| ID | A | ID | B | | ID | C | | ID | D |
+--------+-------------+ +--------+-------------+ +--------+--------+ +--------+--------+
| 9574| F| | 9574| 005912| | 9574| 2016022| | 9574| VD|
| 9576| F| | 9576| 005912| | 9576| 2016022| | 9576| VD|
| 9578| F| | 9578| 005912| | 9578| 2016022| | 9578| VD|
| 9580| F| | 9580| 005912| | 9580| 2016022| | 9580| VD|
| 9582| F| | 9582| 005912| | 9582| 2016022| | 9582| VD|
+--------+-------------+, +--------+-------------+,+--------+--------+,+--------+--------+ )
預期輸出
+--------+-------------+----------+--------+-------+
| ID | A | B | C | D |
+--------+-------------+----------+--------+-------+
| 9574| F| 005912| 2016022| 00|
| 9576| F| 005912| 2016022| 01|
| 9578| F| 005912| 2016022| 20|
| 9580| F| 005912| 2016022| 19|
| 9582| F| 005912| 2016022| 89|
+--------+-------------+----------+--------+-------+
您將需要結合使用foldLeft
和join
。
scala> val dfList = ('a' to 'd').map(col => (1 to 5).zip(col.toInt to col.toInt + 4).toDF("ID", col.toString)).toList
dfList: List[org.apache.spark.sql.DataFrame] = List([ID: int, a: int], [ID: int, b: int], [ID: int, c: int], [ID: int, d: int])
這給了我以下DataFrames:
+---+---+ +---+---+ +---+---+ +---+---+
| ID| a| | ID| b| | ID| c| | ID| d|
+---+---+ +---+---+ +---+---+ +---+---+
| 1| 97| | 1| 98| | 1| 99| | 1|100|
| 2| 98| | 2| 99| | 2|100| | 2|101|
| 3| 99| | 3|100| | 3|101| | 3|102|
| 4|100| | 4|101| | 4|102| | 4|103|
| 5|101| | 5|102| | 5|103| | 5|104|
+---+---+ +---+---+ +---+---+ +---+---+
scala> val joinedDF = dfList.tail.foldLeft(dfList.head)((accDF, newDF) => accDF.join(newDF, Seq("ID")))
joinedDF: org.apache.spark.sql.DataFrame = [ID: int, a: int ... 3 more fields]
scala> joinedDF.show
+---+---+---+---+---+
| ID| a| b| c| d|
+---+---+---+---+---+
| 1| 97| 98| 99|100|
| 2| 98| 99|100|101|
| 3| 99|100|101|102|
| 4|100|101|102|103|
| 5|101|102|103|104|
+---+---+---+---+---+
在Scala中, fold
是一種將集合縮減為單個元素的方法。 在這種情況下,我們開始用列表(頭dfList.head
),然后在列表中(的尾部連接的每個元素dfList.tail
)一起拿到最后一個數據幀。 accDF
是累積的DataFrame(從“迭代”傳遞到“迭代”的獲取),然后newDF
是要添加的下一個或新的DataFrame。
@ evan058提供了有效的解決方案,但我想補充一點,對於並行化操作 , reduce
可能是更好的選擇:
val joinedDF = dfList.reduce((accDF, nextDF) => accDF.join(nextDF, Seq("ID")))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.