简体   繁体   中英

Multiple comma delimited values to separate row in Spark Java

I have the below dataset. Column_1 is comma-separated and Column_2 and Column_3 are separated by Colon. All are string columns. Every comma-separated value from Column_1 should be a separate row in Column_1 and the equivalent values from Column_2 or Column_3 should be populated. Either column_2 or column_3 will be populated and both will not be populated at the same time.

If the number of values in Column_1 doesn't match with the number of equivalent values in column_2 or column_3 then we have to populate null ( Column_1 : I,J and K,L )

Column_1 Column_2 Column_3
A,B,C,D  NULL     N1:N2:N3:N4
E,F      N5:N6    NULL
G        NULL     N7
H        NULL     NULL
I,J      NULL     N8
K,L      N9       NULL

I have to convert the delimited values into rows as below.

Column_1 Column_2
A        N1
B        N2
C        N3
D        N4
E        N5
F        N6
G        N7
H        NULL
I        N8
J        NULL
K        N9
L        NULL

Is there a way to achieve this in Java spark API without using UDF's.

Scala solution... should be similar in Java. You can combine columns 2 and 3 using coalesce , split them with the appropriate delimiter, use arrays_zip to transpose, and explode the results into rows.

df.select(
    explode(
        arrays_zip(
            split(col("Column_1"), ","), 
            coalesce(
                split(coalesce(col("Column_2"), col("Column_3")), ":"), 
                array()
            )
        )
    ).alias("result")
).select(
    "result.*"
).toDF(
    "Column_1", "Column_2"
).show

+--------+--------+
|Column_1|Column_2|
+--------+--------+
|       A|      N1|
|       B|      N2|
|       C|      N3|
|       D|      N4|
|       E|      N5|
|       F|      N6|
|       G|      N7|
|       H|    null|
|       I|      N8|
|       J|    null|
|       K|      N9|
|       L|    null|
+--------+--------+

Here's another way, using transform function you can iterate over element of column_1 and create map that you explode later:

df.withColumn(
    "mappings",
    split(coalesce(col("Column_2"), col("Column_3")), ":")
).selectExpr(
    "explode(transform(split(Column_1, ','), (x, i) -> map(x, mappings[i]))) as mappings"
).selectExpr(
    "explode(mappings) as (Column_1, Column_2)"
).show()

//+--------+--------+
//|Column_1|Column_2|
//+--------+--------+
//|       A|      N1|
//|       B|      N2|
//|       C|      N3|
//|       D|      N4|
//|       E|      N5|
//|       F|      N6|
//|       G|      N7|
//|       H|    null|
//|       I|      N8|
//|       J|    null|
//|       K|      N9|
//|       L|    null|
//+--------+--------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM