[英]Multiple formats in Date Time column in Spark
我正在使用 Spark3.0.1
我有以下数据 csv:
348702330256514,37495066290,9084849,33946,614677375609919,11-02-2018 0:00:00,GENUINE 348702330256514,37495066290,330148,33946,614677375609919,11-02-2018 00:00:00,GENUINE 348702330256514,37495066290,136052, 33946,614677375609919,11-02-2018 0:00:00,GENUINE 348702330256514,37495066290,4310362,33946,614677375609919,11-02-2018 00:00:00,GENUINE 348702330256514,37495066290,9097094,33946,614677375609919,11- 02-2018 0:00:00,GENUINE 348702330256514,37495066290,2291118,33946,614677375609919,11-02-2018 00:00:00,GENUINE 348702330256514,37495066290,4900011,33946,614677375609919,11-02-2018 0:00 :00,GENUINE 348702330256514,37495066290,633447,33946,614677375609919,11-02-2018 0:00:00,GENUINE 348702330256514,37495066290,6259303,33946,614677375609919,11-02-2018 0:00:00,GENUINE 348702330256514, 37495066290,369067,33946,614677375609919,11-02-2018 0:00:00,GENUINE 348702330256514,37495066290,1193207,33946,614677375609919,11-02-2018 0:00:00,GENUINE 348702330256514,37495066290,9335696,33946, 614677375609919,11-02-20 18 0:00:00,正品
正如您所看到的,倒数第二列包含时间戳数据,其中小时列将包含一位数和两位数的数据,具体取决于一天中的小时(这是示例数据,并非所有记录的时间部分都为零) .
这就是问题所在,我尝试按以下方式解决问题:
将列读取为 String,然后使用列方法将其格式化为 TimeStamp 类型。
val schema = StructType(
List(
StructField("_corrupt_record", StringType)
, StructField("card_id", LongType)
, StructField("member_id", LongType)
, StructField("amount", IntegerType)
, StructField("postcode", IntegerType)
, StructField("pos_id", LongType)
, StructField("transaction_dt", StringType)
, StructField("status", StringType)
)
)
// format the timestamp column
def format_time_column(timeStampCol: Column
, formats: Seq[String] = Seq( "dd-MM-yyyy HH:mm:ss", "dd-MM-yyyy H:mm:ss"
, "dd-MM-yyyy HH:m:ss", "dd-MM-yyyy H:m:ss")) ={
coalesce(
formats.map(f => to_timestamp(timeStampCol, f)):_*
)
}
val cardTransaction = spark.read
.format("csv")
.option("header", false)
.schema(schema)
.option("path", cardTransactionFilePath)
.option("columnNameOfCorruptRecord", "_corrupt_record")
.load
.withColumn("transaction_dt", format_time_column(col("transaction_dt")))
cardTransaction.cache()
cardTransaction.show(5)
此代码产生以下错误:
*注意:
如何解决?
在 Spark 3.0 中,我们在 Datetime Patterns for Formatting and Parsing 中定义了我们自己的模式字符串,它是通过 DateTimeFormatter 在底层实现的。
在Spark 2.4及以下版本中,java.text.SimpleDateFormat用于时间戳/日期字符串转换,支持的模式在SimpleDateFormat中有描述。
可以通过将 spark.sql.legacy.timeParserPolicy 设置为 LEGACY 来恢复旧行为。
sparkConf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
问题未解决?试试以下方法:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.