I have a df that has column which is duration representation as string like PT2H
. I want to create a new column minutes_int which can be done in scala using -
import java.time.Duration
Duration.parse('PT2H').toMinutes()
How can I do this on entire column? I get an error when I do -
jsonDF.withColumn("minutes_int", Duration.parse(col("duration_str")).toMinutes())
Error -
error: type mismatch;
found : org.apache.spark.sql.Column
required: CharSequence
How can I fix this?
You can use a User Defined Function to do this, although noting these don't get optimised so you may benefit from writing your own Spark only version.
import java.time.Duration
import org.apache.spark.sql.functions.udf
def durationToMinutes(duration:String) = Duration.parse(duration).toMinutes()
val durationToMinutesUDF = udf(durationToMinutes _)
And then to use it...
jsonDF.withColumn("minutes_int", durationToMinutesUDF(col("duration_str")))
Note you can also register this so you can use it in SQL, ie
spark.udf.register("duration_to_minutes",durationToMinutesUDF)
jsonDF.registerTempTable("json_df")
spark.sql("select duration_to_minutes(duration_str) from json_df")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.