简体   繁体   English

Spark Java 中的多个逗号分隔值分隔行

[英]Multiple comma delimited values to separate row in Spark Java

I have the below dataset.我有以下数据集。 Column_1 is comma-separated and Column_2 and Column_3 are separated by Colon. Column_1逗号分隔, Column_2Column_3用冒号分隔。 All are string columns.都是字符串列。 Every comma-separated value from Column_1 should be a separate row in Column_1 and the equivalent values from Column_2 or Column_3 should be populated. Column_1 中的每个逗号分隔值都应该是Column_1中的单独行, Column_3应该填充Column_2Column_1中的等效值。 Either column_2 or column_3 will be populated and both will not be populated at the same time. column_2column_3将被填充,并且不会同时填充两者。

If the number of values in Column_1 doesn't match with the number of equivalent values in column_2 or column_3 then we have to populate null ( Column_1 : I,J and K,L )如果Column_1中的值的数量与column_2column_3中的等效值的数量不匹配,那么我们必须填充 null ( Column_1 : I,JK,L )

Column_1 Column_2 Column_3
A,B,C,D  NULL     N1:N2:N3:N4
E,F      N5:N6    NULL
G        NULL     N7
H        NULL     NULL
I,J      NULL     N8
K,L      N9       NULL

I have to convert the delimited values into rows as below.我必须将分隔值转换为如下行。

Column_1 Column_2
A        N1
B        N2
C        N3
D        N4
E        N5
F        N6
G        N7
H        NULL
I        N8
J        NULL
K        N9
L        NULL

Is there a way to achieve this in Java spark API without using UDF's.有没有办法在不使用 UDF 的情况下在 Java spark API 中实现这一点。

Scala solution... should be similar in Java. Scala 解决方案...应该在 Java 中类似。 You can combine columns 2 and 3 using coalesce , split them with the appropriate delimiter, use arrays_zip to transpose, and explode the results into rows.您可以使用coalesce组合第 2 列和第 3 列,使用适当的分隔符将它们拆分,使用arrays_zip进行转置explode并将结果分解为行。

df.select(
    explode(
        arrays_zip(
            split(col("Column_1"), ","), 
            coalesce(
                split(coalesce(col("Column_2"), col("Column_3")), ":"), 
                array()
            )
        )
    ).alias("result")
).select(
    "result.*"
).toDF(
    "Column_1", "Column_2"
).show

+--------+--------+
|Column_1|Column_2|
+--------+--------+
|       A|      N1|
|       B|      N2|
|       C|      N3|
|       D|      N4|
|       E|      N5|
|       F|      N6|
|       G|      N7|
|       H|    null|
|       I|      N8|
|       J|    null|
|       K|      N9|
|       L|    null|
+--------+--------+

Here's another way, using transform function you can iterate over element of column_1 and create map that you explode later:这是另一种方式,使用transform function 可以迭代column_1的元素并创建稍后分解的 map:

df.withColumn(
    "mappings",
    split(coalesce(col("Column_2"), col("Column_3")), ":")
).selectExpr(
    "explode(transform(split(Column_1, ','), (x, i) -> map(x, mappings[i]))) as mappings"
).selectExpr(
    "explode(mappings) as (Column_1, Column_2)"
).show()

//+--------+--------+
//|Column_1|Column_2|
//+--------+--------+
//|       A|      N1|
//|       B|      N2|
//|       C|      N3|
//|       D|      N4|
//|       E|      N5|
//|       F|      N6|
//|       G|      N7|
//|       H|    null|
//|       I|      N8|
//|       J|    null|
//|       K|      N9|
//|       L|    null|
//+--------+--------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM