简体   繁体   中英

How to concatenate null columns in spark dataframe in java?

I am working with spark in java and I want to create a column which is a concatenation of all other column values separated by comma. I have tried few ways to do this but couldn't find a solution.

For example-

col1 | col2 | col3 | result
 1   | john | 2.3  | 1,john,2.3

The problem that I'm facing is if a column value is null then simply nothing should be there but so far I have not achieved this.

For example-

col1 | col2 | col3 | result
 1   | null | 2.3  | 1,,2.3  -> this is what I'm trying to achieve

What I have tried -

1.) I tried concat function of spark sql but if a column value is null then entire value of concat function will be null, which is not something i want.

2.) concat_ws function simply ignore the null values hence can't be used. Hence in my 2nd example the result would be something like 1,2.3

3.) using coalesce require matching data types so if I do something like coalesce(col_name,'') under concat function and col_name is number data type then it throws data types mismatch error.

4.) using case when statement also require same data type in THEN & ELSE conditions. So if I say case when column_1 is null THEN '' ELSE column1 END will throw error if column1 is number because '' is empty string and column_1 is number.

one way it can be achieved is creating map_function. But is there a way to do it in spark sql way?

You can cast all your fields to String so that you can run NVL on them and set them to empty string '' if they're null.

To generate an expression for all the columns in your dataframe automatically, you can use map function:

df.show()
//+----+----+----+
//|col1|col2|col3|
//+----+----+----+
//|   1|John| 2.3|
//|   2|null| 2.3|
//+----+----+----+


List<String> columns = df.columns() ; 
// List("col1", "col2", "col3")

String nvlExpr = columns.stream().map(i -> "nvl(cast ("+i+" as string),'')").collect(joining(", ", "concat_ws(','," , ")"));

//concat_ws(',',nvl(cast (col1 as string), ''), nvl(cast (col2 as string), ''),nvl(cast (col3 as string), ''))

df.withColumn("result", expr(nvlExpr)).show()

//+----+----+----+----------+
//|col1|col2|col3|    result|
//+----+----+----+----------+
//|   1|John| 2.3|1,John,2.3|
//|   2|null| 2.3|    2,,2.3|
//+----+----+----+----------+

This way you can concat

 scala> spark.sql("select *, concat(cast(col1 as string),',',col2,',',cast(col3 as string)) as result from (select col1, col3, case when col2 is null then '' else col2 end as col2 from final)").show() +----+----+----+--------+ |col1|col3|col2| result| +----+----+----+--------+ | 1| 2.3| JK|1,JK,2.3| | 1| 2.3| | 1,,2.3| +----+----+----+--------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM