Convert String to date SparkSQL

Question

I'm new to in Scala, I have dataframe where I'm trying one column of dataframe to date from string in other word like below

1)    yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56) 
2)    yyyyMMddHHmmss(20150611      ) ->yyyy-MM-dd(2015-06-11)

First case i'm able achieve successfully but problem with second case where time is miss due to of this i'm not bale to convert into date.More details you could get below.Any help will be appreciated.

df.printSchema
root
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- SQ_CODE: string (nullable = true)
 |-- RE_TYPE: string (nullable = true)
 |-- VERY_ID: long (nullable = true)
 |-- IN_DATE: string (nullable = true)


df.show
Input  
+-----+-------+---------+---------+-------------------+-----------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE        |
+-----+-------+---------+---------+-------------------+-----------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 20150610120256  |
|   F | 000544|  2017002|      LD |  95032015062763261| 20150611        |
|   F | 000544|  2017002|      AK |  95037854336743246| 20150611012356  |
+-----+-------+---------+--+------+-------------------+-----------------+

df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
        to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
        .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))

Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| null                 |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+

df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
        to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
        .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))

Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11 00:00:00  |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+


Expected output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
+-----+-------+---------+---------+-------------------+----------------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
|   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11           |
|   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
+-----+-------+---------+--+------+-------------------+----------------------+

Answer 1

I'd

Choose more precise data type - here TimestampType .
coalesce with different formats.

import org.apache.spark.sql.functions._

val df = Seq("20150610120256", "20150611").toDF("IN_DATE")

df.withColumn("IN_DATE", coalesce(
  to_timestamp($"IN_DATE", "yyyyMMddHHmmss"), 
  to_timestamp($"IN_DATE", "yyyyMMdd"))).show


+-------------------+
|            IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+

Answer 2

There are several options to achieve a date parser.

Use the built in spark sql function TODATE() . Here's an example of that implementation.
Create a user defined function where you can do different date parsing based on the input format you like, and return the string. Read more about UDF's here .

Answer 3

Try This query

df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast(DateType)))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)))

Answer 4

2015-06-11 format is spark.sql.types.DateType and 2015-06-10 12:02:56 is spark.sql.types.TimestampType

You can't have two dataType on the same column. A schema should have only one dataType for each columns .

I would suggest you to create two new columns and have the format you desire in them as

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
  .withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType))

this will give you dataframe as

+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE  |SQ_CODE|RE_TYPE|VERY_ID          |IN_DATE       |IN_DATE_DateOnly|IN_DATE_DateAndTime  |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F   |000544|2017002|OP     |95032015062763298|20150610120256|null            |2015-06-10 12:02:00.0|
|F   |000544|2017002|LD     |95032015062763261|20150611      |2015-06-11      |null                 |
|F   |000544|2017002|AK     |95037854336743246|20150611012356|null            |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+

You can see that the dataType is different

root
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- SQ_CODE: string (nullable = true)
 |-- RE_TYPE: string (nullable = true)
 |-- VERY_ID: string (nullable = true)
 |-- IN_DATE: string (nullable = true)
 |-- IN_DATE_DateOnly: date (nullable = true)
 |-- IN_DATE_DateAndTime: timestamp (nullable = true)

I hope the answer is helpful

Convert String to date SparkSQL

Question

4 answers

solution1
2 2017-09-02 08:53:33

solution2
1 2017-09-02 07:00:05

solution3
0 2017-09-02 05:48:40

solution4
0 2017-09-02 08:17:42

Convert String to date SparkSQL

Question

4 answers

solution1 2 2017-09-02 08:53:33

solution2 1 2017-09-02 07:00:05

solution3 0 2017-09-02 05:48:40

solution4 0 2017-09-02 08:17:42

solution1
2 2017-09-02 08:53:33

solution2
1 2017-09-02 07:00:05

solution3
0 2017-09-02 05:48:40

solution4
0 2017-09-02 08:17:42