简体   繁体   中英

Scala substring and store it in a DF

I am trying to split a string in scala and store it in a DF to use it with Apache Spark. The string that I have is the following:

fromTo: NT=xxx_bt_bsns_m,OD=ntis,OS=wnd,SX=xs,SZ=ddp,
fromTo: NT=xds_bt2_bswns_m,OD=nis,OS=wnd,SX=xs,SZ=ddp,
fromTo: NT=xxa_bt1_b1ns_m,OD=nts,OS=nd,SX=xs,SZ=ddp

I just want to get the following substrings :

xxx_bt_bsns_m
xds_bt2_bswns_m
xxa_bt1_b1ns_m

and then store it in a DF to show something like:

+--------------------+
|       Name         |
+--------------------+
|   xxx_bt_bsns_m    |
|  xds_bt2_bswns_m   |
|   xxa_bt1_b1ns_m   |
+--------------------+

So what i have to try to get all the string that start with NT and ends with a "," maybe using a pattern with regex and then store it in a DF?

I am starting with scala so for this reason i am having doubts with this.

Thanks in advance!

You can do this using an UDF:

val rgx = "^fromTo: NT=([a-zA-Z0-9_]+),(.*)".r

val udfToExtract = udf { str : String => str match { case (rgx(gr1, _)) => gr1} }

it gives:

+-----------------------------------------------------+-------------+
|text                                                 |textNew      |
+-----------------------------------------------------+-------------+
|fromTo: NT=xxx_bt_bsns_m,OD=ntis,OS=wnd,SX=xs,SZ=ddp,|xxx_bt_bsns_m|
+-----------------------------------------------------+-------------+

Or using regex_extract:

df.select(regexp_extract($"text", "^fromTo: NT=([a-zA-Z0-9_]+),(.*)", 1).as("textNew")).show()

It gives also:

+-------------+
|      textNew|
+-------------+
|xxx_bt_bsns_m|
+-------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM