I have a dataframe with column as Date along with few other columns.
I wanted to validate Date column value and check if the format is of "dd/MM/yyyy". If Date column holds any other format than should mark it as bad record. So I am using option option("dateFormat", "dd/MM/yyyy") to accept date in mentioned format and it accepts the date properly in format "dd/MM/yyyy", but if I pass invalid format (YYYY/mm/dd) still record is not marking as invalid and passed date is converting to garbage
Input File:
colData1,2020/05/07,colData2,colData3
colData4,2020/05/07,colData5,colData6
colData7,2020/05/07,colData8,colData9
df = spark.read.format(
"com.databricks.spark.csv").schema(customSchema).option(
"escape", '\"').option(
"quote", '\"').option(
"header", "false").option(
"dateFormat", "dd/MM/yyyy").option(
"columnNameOfCorruptRecord","badRecords").csv(
rdd)
df.show()
DataFrame O/P:-
+--------+----------+--------+--------+----------+-----+
| OMIC| SMIC| VCLP| VName|badRecords|rowId|
+--------+----------+--------+--------+----------+-----+
|colData1|0012-11-09|colData2|colData3| null| 0|
|colData4|0012-11-09|colData5|colData6| null| 1|
|colData7|0012-11-09|colData8|colData9| null| 2|
+--------+----------+--------+--------+----------+-----+
Please suggest
You can't achieve it with dateFormat
option, it accepts 2020/05/07
but changes format to invalid 0012-11-09
. You can try to change SMIC
column type to StringType
in your schema and then convert it to date with correct format using function to_date
.
from pyspark.sql import functions as f
df2 = df.withColumn("badRecord", f.when(f.to_date(f.col("SMIC"),"dd/MM/yyyy").isNotNull, False).otherwise(True))
output:
+--------+----------+--------+--------+---------+
| OMIC| SMIC| VCLP| VName|badRecord|
+--------+----------+--------+--------+---------+
|colData1|2020/05/07|colData2|colData3| true|
|colData4|2020/05/07|colData5|colData6| true|
|colData7|2020/05/07|colData8|colData9| true|
|colData7|07/05/2020|colData8|colData9| false|
+--------+----------+--------+--------+---------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.