A regex in SQL or Spark (Scala)

Question

I am a new developer in Spark Scala. I am not familiar with Regex but I want to write a regex that can extract an ID like this :

abcd_mss5884_mww020_025_b => mss5884
abv_c_e_mss478_mww171_172 => mss478
abv_c_e_mww171_172  => otherwise, return THE SAME input string

So, in our input string, I should return the first characters starting from "mss...." and stop when I find the first "_" after the "mss" of course (i should ignore the other underscores).

How can I do this please ? Should I use a regex ? A regex in SQL or in Scala ? Or should I just use a simple substring method ??

Answer 1

Simply use regexp_extract function. Something like this:

val df = Seq(("abcd_mss5884_mww020_025_b"), ("abv_c_e_mss478_mww171_172"), ("abv_c_e_mww171_172")).toDF("input")

df.withColumn("ID", regexp_extract($"input", "^(.*)(mss[^_]+)_(.*)$", 2))
  .withColumn("ID", when($"ID" =!= "", $"ID").otherwise($"input"))
  .show()

+-------------------------+------------------+
|input                    |ID                |
+-------------------------+------------------+
|abcd_mss5884_mww020_025_b|mss5884           |
|abv_c_e_mss478_mww171_172|mss478            |
|abv_c_e_mww171_172       |abv_c_e_mww171_172|
+-------------------------+------------------+

A regex in SQL or Spark (Scala)

Question

1 answers

solution1
0 ACCPTED 2020-02-06 14:03:33

A regex in SQL or Spark (Scala)

Question

1 answers

solution1 0 ACCPTED 2020-02-06 14:03:33

solution1
0 ACCPTED 2020-02-06 14:03:33