I'm new to in Scala, I have dataframe where I'm trying one column of dataframe to date from string in other word like below
1) yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56)
2) yyyyMMddHHmmss(20150611 ) ->yyyy-MM-dd(2015-06-11)
First case i'm able achieve successfully but problem with second case where time is miss due to of this i'm not bale to convert into date.More details you could get below.Any help will be appreciated.
df.printSchema
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: long (nullable = true)
|-- IN_DATE: string (nullable = true)
df.show
Input
+-----+-------+---------+---------+-------------------+-----------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+-----------------+
| F | 000544| 2017002| OP | 95032015062763298| 20150610120256 |
| F | 000544| 2017002| LD | 95032015062763261| 20150611 |
| F | 000544| 2017002| AK | 95037854336743246| 20150611012356 |
+-----+-------+---------+--+------+-------------------+-----------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| null |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 00:00:00 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
Expected output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
I'd
TimestampType
. coalesce
with different formats. import org.apache.spark.sql.functions._
val df = Seq("20150610120256", "20150611").toDF("IN_DATE")
df.withColumn("IN_DATE", coalesce(
to_timestamp($"IN_DATE", "yyyyMMddHHmmss"),
to_timestamp($"IN_DATE", "yyyyMMdd"))).show
+-------------------+
| IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+
There are several options to achieve a date parser.
TODATE()
. Here's an example of that implementation. Try This query
df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast(DateType)))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)))
2015-06-11
format is spark.sql.types.DateType
and 2015-06-10 12:02:56
is spark.sql.types.TimestampType
You can't have two dataType on the same column. A schema should have only one dataType for each columns .
I would suggest you to create two new columns and have the format you desire in them as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
.withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType))
this will give you dataframe
as
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE |SQ_CODE|RE_TYPE|VERY_ID |IN_DATE |IN_DATE_DateOnly|IN_DATE_DateAndTime |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F |000544|2017002|OP |95032015062763298|20150610120256|null |2015-06-10 12:02:00.0|
|F |000544|2017002|LD |95032015062763261|20150611 |2015-06-11 |null |
|F |000544|2017002|AK |95037854336743246|20150611012356|null |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
You can see that the dataType is different
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: string (nullable = true)
|-- IN_DATE: string (nullable = true)
|-- IN_DATE_DateOnly: date (nullable = true)
|-- IN_DATE_DateAndTime: timestamp (nullable = true)
I hope the answer is helpful
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.