简体   繁体   中英

Split String Column on the Dataset<Row> with comma and get new Dataset<Row>

I am working on Spark SQL with Spark(2.0) and using Java API for reading CSV.

In CSV file there is a double quotes, comma separated Column. Ex: "Express Air,Delivery Truck"

Code for reading CSV and returning Dataset:

Dataset<Row> df = spark.read()
                .format("com.databricks.spark.csv")
                .option("inferSchema", "true")
                .option("header", "true")
                .load(filename) 

Result:

+-----+--------------+--------------------------+
|Year |       State  |                Ship Mode |...
+-----+--------------+--------------------------+
|2012 |New York      |Express Air,Delivery Truck|...
|2013 |Nevada        |Delivery Truck            |...
|2013 |North Carolina|Regular Air,Delivery Truck|...
+-----+--------------+--------------------------+

But, I want to split Shop Mode to Mode1 and Mode2 Column and return as a Dataset.

+-----+--------------+--------------+---------------+
|Year |       State  |     Mode1    |         Mode2 |...
+-----+--------------+--------------+---------------+
|2012 |New York      |Express Air   |Delivery Truck |...
|2013 |Nevada        |Delivery Truck|null           |...
|2013 |North Carolina|Regular Air   |Delivery Truck |...
+-----+--------------+--------------+---------------+

Is there any way I can do this using Java Spark?

I tried with MapFunction, but call() method not returning Row. Ship Mode will be Dynamic ie, CSV may contain one Ship Mode or two.

Thanks.

You can use selectExpr , a variant of select that accepts SQL expressions , like this:

df.selectExpr("Year","State","split(Ship Mode, ',')[0] as Mode1","split(Ship Mode, ',')[1] as Mode2");

The result is a Dataset of Row.

We could:

  • define a User Defined Function (UDF) to do the split operation only once
  • use the select expression to map the splitted column into two new columns

eg.:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Column, Row}

val splitter = udf((str: String) => {
  val splitted = str.split(",").lift
  Array(splitted(0), splitted(1))
})

val dfShipMode = df.select($"year",$"state", splitter($"shipMode") as "modes")
                   .select($"year", $"state", $"modes"(0) as "mode1", $"modes"(1) as "mode2")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM