简体   繁体   中英

Spark Scala - Duration to Mins on Spark Dataframe column

I have a df that has column which is duration representation as string like PT2H . I want to create a new column minutes_int which can be done in scala using -

import java.time.Duration
Duration.parse('PT2H').toMinutes()

How can I do this on entire column? I get an error when I do -

jsonDF.withColumn("minutes_int", Duration.parse(col("duration_str")).toMinutes())

Error -

error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: CharSequence

How can I fix this?

You can use a User Defined Function to do this, although noting these don't get optimised so you may benefit from writing your own Spark only version.

import java.time.Duration
import org.apache.spark.sql.functions.udf
def durationToMinutes(duration:String) = Duration.parse(duration).toMinutes()
val durationToMinutesUDF = udf(durationToMinutes _)

And then to use it...

jsonDF.withColumn("minutes_int", durationToMinutesUDF(col("duration_str")))

Note you can also register this so you can use it in SQL, ie

spark.udf.register("duration_to_minutes",durationToMinutesUDF)
jsonDF.registerTempTable("json_df")
spark.sql("select duration_to_minutes(duration_str) from json_df")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM