简体   繁体   中英

Convert String to date SparkSQL

I'm new to in Scala, I have dataframe where I'm trying one column of dataframe to date from string in other word like below

1)    yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56) 
2)    yyyyMMddHHmmss(20150611      ) ->yyyy-MM-dd(2015-06-11)

First case i'm able achieve successfully but problem with second case where time is miss due to of this i'm not bale to convert into date.More details you could get below.Any help will be appreciated.

df.printSchema
root
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- SQ_CODE: string (nullable = true)
 |-- RE_TYPE: string (nullable = true)
 |-- VERY_ID: long (nullable = true)
 |-- IN_DATE: string (nullable = true)


df.show
Input  
+-----+-------+---------+---------+-------------------+-----------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE        |
+-----+-------+---------+---------+-------------------+-----------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 20150610120256  |
|   F | 000544|  2017002|      LD |  95032015062763261| 20150611        |
|   F | 000544|  2017002|      AK |  95037854336743246| 20150611012356  |
+-----+-------+---------+--+------+-------------------+-----------------+

df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
        to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
        .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))

Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| null                 |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+

df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
        to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
        .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))

Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11 00:00:00  |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+


Expected output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11           |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+

I'd

  • Choose more precise data type - here TimestampType .
  • coalesce with different formats.
import org.apache.spark.sql.functions._

val df = Seq("20150610120256", "20150611").toDF("IN_DATE")

df.withColumn("IN_DATE", coalesce(
  to_timestamp($"IN_DATE", "yyyyMMddHHmmss"), 
  to_timestamp($"IN_DATE", "yyyyMMdd"))).show


+-------------------+
|            IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+

There are several options to achieve a date parser.

  1. Use the built in spark sql function TODATE() . Here's an example of that implementation.
  2. Create a user defined function where you can do different date parsing based on the input format you like, and return the string. Read more about UDF's here .

Try This query

df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast(DateType)))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)))

2015-06-11 format is spark.sql.types.DateType and 2015-06-10 12:02:56 is spark.sql.types.TimestampType

You can't have two dataType on the same column. A schema should have only one dataType for each columns .

I would suggest you to create two new columns and have the format you desire in them as

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
  .withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)) 

this will give you dataframe as

+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE  |SQ_CODE|RE_TYPE|VERY_ID          |IN_DATE       |IN_DATE_DateOnly|IN_DATE_DateAndTime  |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F   |000544|2017002|OP     |95032015062763298|20150610120256|null            |2015-06-10 12:02:00.0|
|F   |000544|2017002|LD     |95032015062763261|20150611      |2015-06-11      |null                 |
|F   |000544|2017002|AK     |95037854336743246|20150611012356|null            |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+

You can see that the dataType is different

root
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- SQ_CODE: string (nullable = true)
 |-- RE_TYPE: string (nullable = true)
 |-- VERY_ID: string (nullable = true)
 |-- IN_DATE: string (nullable = true)
 |-- IN_DATE_DateOnly: date (nullable = true)
 |-- IN_DATE_DateAndTime: timestamp (nullable = true)

I hope the answer is helpful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM