I'm trying to merge two dataframes and create a dataframe with a new column containing the other dataframe as an array. Does anyone know how this can be implemented in scala?
//Schema 1
PRIM_KEY: decimal(20,0) (nullable = true)
|-- SOME_DECIMAL: decimal(20,0) (nullable = true)
|-- SOME_INTEGER: integer (nullable = true)
//Schema 2
PRIM_KEY: decimal(20,0) (nullable = true)
|-- COLUMN1: string (nullable = false)
|-- COLUMN2: string (nullable = false)
//Resulting schema
RIM_KEY: decimal(20,0) (nullable = true)
|-- SOME_DECIMAL: decimal(20,0) (nullable = true)
|-- SOME_INTEGER: integer (nullable = true)
|-- an_array: array (nullable = true)
| |-- element: String (containsNull = false)
One approach would be to create an UDF
that combines two lists into one, perform a groupBy
on the joined dataframes, and apply the UDF
like in the following:
val df1 = Seq(
(1, 100.1, 10),
(2, 200.2, 20)
).toDF("pk", "col1", "col2")
val df2 = Seq(
(1, "a1", "b1"),
(1, "c1", "d1"),
(2, "a2", "b2")
).toDF("pk", "str_col1", "str_col2")
def combineLists = udf(
(a: Seq[String], b: Seq[String]) => a ++ b
)
val df3 = df1.join(df2, Seq("pk")).
groupBy(df1("pk"), df1("col1"), df1("col2")).agg(
combineLists(collect_list(df2("str_col1")), collect_list(df2("str_col2"))).alias("arr_col")
).
select(df1("pk"), df1("col1"), df1("col2"), col("arr_col"))
df3.show
+---+-----+----+----------------+
| pk| col1|col2| arr_col|
+---+-----+----+----------------+
| 1|100.1| 10|[c1, a1, d1, b1]|
| 2|200.2| 20| [a2, b2]|
+---+-----+----+----------------+
The result you are seeking:
RIM_KEY: decimal(20,0) (nullable = true)
|-- SOME_DECIMAL: decimal(20,0) (nullable = true)
|-- SOME_INTEGER: integer (nullable = true)
|-- an_array: array (nullable = true)
| |-- element: String (containsNull = false)
Let me tell you first:
array (nullable = true) is not a datatype but its a data structure. So you can't at all define schema as DataType Array.
One way is to concatenate with string using concat_ws and perform a withcolumn operation on the 2nd dataset.
EG:
val tmpDf = test2Df.select(concat_ws(",", col("NAME"), col("CLASS")).as("ARRAY_COLUMN"))
val mergedDf = test1Df.withColumn("ARRAY_COLUMN",tmpDf.col("ARRAY_COLUMN"))
I don't understand what is your use case to use an array type as schema but instead you can use the concatenated result and convert into Array.
Hope this helps you, i know am bit late here to answer but still I'll be happy if it helps you even now.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.