简体   繁体   English

Spark DataFrame String类型列到Timestamp / Date类型列的转换

[英]Spark DataFrame String type column to Timestamp/Date type column conversion

I have dataframe with two string columns c1dt and c2tm and it's format is yyyymmdd and yyyymmddTHHmmss.SSSz respectively. 我有两个字符串列c1dt和c2tm的数据框,其格式分别为yyyymmdd和yyyymmddTHHmmss.SSSz。 Now I want to convert these columns into date type and timestamp type columns and I tried the following but it doesn't work it shows columns values as null. 现在,我想将这些列转换为日期类型和时间戳类型的列,并尝试了以下操作,但它不起作用,它将列值显示为null。

val newdf = df.withColumn("c1dt", unix_timestmap("c1dt","yyyymmdd").cast("date").withColumn("c2tm","yyyymmddTHHmmss.SSSz").cast("timestamp"))

When I call newdf.show both columns values show as null. 当我调用newdf.show时,两列值都显示为null。 If I print original dataframe df I see date and timestamp values. 如果我打印原始数据帧df,则会看到日期和时间戳值。

Since you timestamp format is not the default one your best bet is probably to create a udf. 由于时间戳记格式不是默认格式,因此最好的选择是创建一个udf。

def _stringToTs(s: String): Timestamp = {
  val format = new SimpleDateFormat("yyyymmddTHHmmss.SSSz")
  val date = format.parse(timestamp)
  new Timestamp(miliseconds);
}
import org.apache.spark.sql.functions.udf
val stringToTS = udf(_stringToTS)
val newdf = df.withColumn("c1dt", stringToTS($"c1dt").cast("date").withColumn("c2tm",stringToTS($"c2tm")))

In case you data is coming from a CSV you can specify the timestamp format before you load the data which will be faster overall 如果您的数据来自CSV,则可以在加载数据之前指定时间戳格式,这样总体上会更快

spark.read
      .format("csv")
      .option("inferSchema", "true") // Automatically infer data types
      .option("timestampFormat", "yyyymmddTHHmmss.SSSz")  
      .load("path")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM