i am using spark 1.6 with scala.
I have created a Dataframe which looks like below.
DATA
SKU, MAKE, MODEL, GROUP SUBCLS IDENT
IM, AN4032X, ADH3M032, RM, 1011, 0
IM, A3M4936, MP3M4936, RM, 1011, 0
IM, AK116BC, 3M4936P, 05, ABC, 0
IM, A-116-B, 16ECAPS, RM, 1011, 0
I am doing data validation and capture any record in new dataframe which violate the rule.
Rule:
Column “GROUP” must be character
Column “SUBCLS” must be NUMERIC
Column “IDENT” must be 0
The new Dataframe will looks like
AUDIT TABLE
SKU MAKE AUDIT_SKU AUDIT_MAKE AUDIT_MODEL AUDIT_GRP AUDIT_SUBCLS Audit_IDENT
IM, A-K12216BC, N, N, N, Y, Y, N
Y represent rule violation and N represent Rule pass.
i have validated rule using isnull or regex for ex: checking column Group using
regex: df.where( $"GROUP".rlike("^[A-Za-z]}$")).show
May someone please help me how can i do this in SPARK SQL. is it possible to create a dataframe with the above scenario.
Thanks
you can use rlike with |
scala> df.withColumn("Group1",when($"GROUP".rlike("^[\\d+]|[A-Za-z]\\d+"),"Y").otherwise("N")).withColumn("SUBCLS1",when($"SUBCLS".rlike("^[0-9]"),"N").otherwise("Y")).withColumn("IDENT1",when($"IDENT"==="0","N").otherwise("Y")).show()
+---+-------+--------+-----+------+-----+------+-------+------+
|SKU| MAKE| MODEL|GROUP|SUBCLS|IDENT|Group1|SUBCLS1|IDENT1|
+---+-------+--------+-----+------+-----+------+-------+------+
| IM|AN4032X|ADH3M032| RM| 1011| 0| N| N| N|
| IM|A3M4936|MP3M4936| 1RM| 1011| 0| Y| N| N|
| IM|AK116BC| 3M4936P| 05| ABC| 0| Y| Y| N|
| IM|A-116-B| 16ECAPS| RM1| 1011| 0| Y| N| N|
+---+-------+--------+-----+------+-----+------+-------+------+
just write version 1 of each column for understanding purpose only you can overwrite column. let me know if you need any help on the same.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.