[英]Multiple comma delimited values to separate row in Spark Java
I have the below dataset.我有以下数据集。
Column_1
is comma-separated and Column_2
and Column_3
are separated by Colon. Column_1
逗号分隔, Column_2
和Column_3
用冒号分隔。 All are string columns.都是字符串列。 Every comma-separated value from
Column_1
should be a separate row in Column_1
and the equivalent values from Column_2
or Column_3
should be populated. Column_1 中的每个逗号分隔值都应该是
Column_1
中的单独行, Column_3
应该填充Column_2
或Column_1
中的等效值。 Either column_2
or column_3
will be populated and both will not be populated at the same time. column_2
或column_3
将被填充,并且不会同时填充两者。
If the number of values in Column_1
doesn't match with the number of equivalent values in column_2
or column_3
then we have to populate null ( Column_1
: I,J
and K,L
)如果
Column_1
中的值的数量与column_2
或column_3
中的等效值的数量不匹配,那么我们必须填充 null ( Column_1
: I,J
和K,L
)
Column_1 Column_2 Column_3
A,B,C,D NULL N1:N2:N3:N4
E,F N5:N6 NULL
G NULL N7
H NULL NULL
I,J NULL N8
K,L N9 NULL
I have to convert the delimited values into rows as below.我必须将分隔值转换为如下行。
Column_1 Column_2
A N1
B N2
C N3
D N4
E N5
F N6
G N7
H NULL
I N8
J NULL
K N9
L NULL
Is there a way to achieve this in Java spark API without using UDF's.有没有办法在不使用 UDF 的情况下在 Java spark API 中实现这一点。
Scala solution... should be similar in Java. Scala 解决方案...应该在 Java 中类似。 You can combine columns 2 and 3 using
coalesce
, split them with the appropriate delimiter, use arrays_zip
to transpose, and explode
the results into rows.您可以使用
coalesce
组合第 2 列和第 3 列,使用适当的分隔符将它们拆分,使用arrays_zip
进行转置explode
并将结果分解为行。
df.select(
explode(
arrays_zip(
split(col("Column_1"), ","),
coalesce(
split(coalesce(col("Column_2"), col("Column_3")), ":"),
array()
)
)
).alias("result")
).select(
"result.*"
).toDF(
"Column_1", "Column_2"
).show
+--------+--------+
|Column_1|Column_2|
+--------+--------+
| A| N1|
| B| N2|
| C| N3|
| D| N4|
| E| N5|
| F| N6|
| G| N7|
| H| null|
| I| N8|
| J| null|
| K| N9|
| L| null|
+--------+--------+
Here's another way, using transform
function you can iterate over element of column_1
and create map that you explode later:这是另一种方式,使用
transform
function 可以迭代column_1
的元素并创建稍后分解的 map:
df.withColumn(
"mappings",
split(coalesce(col("Column_2"), col("Column_3")), ":")
).selectExpr(
"explode(transform(split(Column_1, ','), (x, i) -> map(x, mappings[i]))) as mappings"
).selectExpr(
"explode(mappings) as (Column_1, Column_2)"
).show()
//+--------+--------+
//|Column_1|Column_2|
//+--------+--------+
//| A| N1|
//| B| N2|
//| C| N3|
//| D| N4|
//| E| N5|
//| F| N6|
//| G| N7|
//| H| null|
//| I| N8|
//| J| null|
//| K| N9|
//| L| null|
//+--------+--------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.