[英]A regex in SQL or Spark (Scala)
I am a new developer in Spark Scala.我是 Spark Scala 的新开发人员。 I am not familiar with Regex but I want to write a regex that can extract an ID like this :我不熟悉正则表达式,但我想编写一个可以提取这样的 ID 的正则表达式:
abcd_mss5884_mww020_025_b => mss5884
abv_c_e_mss478_mww171_172 => mss478
abv_c_e_mww171_172 => otherwise, return THE SAME input string
So, in our input string, I should return the first characters starting from "mss...." and stop when I find the first "_" after the "mss" of course (i should ignore the other underscores).因此,在我们的输入字符串中,我应该返回从“mss....”开始的第一个字符,并在“mss”之后找到第一个“_”时停止(我应该忽略其他下划线)。
How can I do this please ?请问我该怎么做? Should I use a regex ?我应该使用正则表达式吗? A regex in SQL or in Scala ? SQL 或 Scala 中的正则表达式? Or should I just use a simple substring method ??或者我应该只使用一个简单的子字符串方法??
Simply use regexp_extract
function.只需使用regexp_extract
函数。 Something like this:像这样的东西:
val df = Seq(("abcd_mss5884_mww020_025_b"), ("abv_c_e_mss478_mww171_172"), ("abv_c_e_mww171_172")).toDF("input")
df.withColumn("ID", regexp_extract($"input", "^(.*)(mss[^_]+)_(.*)$", 2))
.withColumn("ID", when($"ID" =!= "", $"ID").otherwise($"input"))
.show()
+-------------------------+------------------+
|input |ID |
+-------------------------+------------------+
|abcd_mss5884_mww020_025_b|mss5884 |
|abv_c_e_mss478_mww171_172|mss478 |
|abv_c_e_mww171_172 |abv_c_e_mww171_172|
+-------------------------+------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.