简体   繁体   中英

How to do regEx in Spark SQL

I have to create a data frame in which the rows in one column should be a name I extract from a long URL. Let's say I have the following url:

https://xxx.xxxxxx.com/xxxxx/y...y/?...?/<irrelevant>

Now unfortunately I can't disclose the exact URLs but what I can say is that the letters x contain strings that don't change (ie all URLs in the database contain those patterns and are known), the y...y is an username that is unknown, with unknown length and may change with each URL and the ?...? is the name in which I am interested in (again a string with unknown length). After that there may be multiple strings separated by / which are not useful. How exactly would I do that? Up until now I used to do three different UDFs which use substrings and indexes but I think that's a very cumbersome solution.

I am not very familiar with Regex or with Spark SQL, so even just the regex would be useful.

Thanks

Edit: I think I got the regex down, now I just need to find out how to use it.

https:\/\/xxx\.xxxxxx\.com\/xxxxx\/(?:[^0-9\/]+)\/([a-zA-z]*)

I have a bit modified your regex. Regex:

^https:\/\/www\.example\.com\/user=\/(.*?)\/(.*?)(?:\/.*|$)$

It will capture two groups:

  • 1st group - username
  • 2nd group - some name

You can use regexp_extract spark function for selecting regex capture groups. Eg

import spark.implicits._
import org.apache.spark.sql.functions.regexp_extract

val df = Seq(
    ("https://www.example.com/user=/username1/name3/asd"),
    ("https://www.example.com/user=/username2/name2"),
    ("https://www.example.com/user=/username3/name1/asd"),
    ("https://www.example.com/user=")
).toDF("url")

val r = "^https:\\/\\/www\\.example\\.com\\/user=\\/(.*?)\\/(.*?)(?:\\/.*|$)$"

df.select(
    $"url",
    regexp_extract($"url", r, 1).as("username"),
    regexp_extract($"url", r, 2).as("name")
).show(false)

Result:

+-------------------------------------------------+---------+-----+
|url                                              |username |name |
+-------------------------------------------------+---------+-----+
|https://www.example.com/user=/username1/name3/asd|username1|name3|
|https://www.example.com/user=/username2/name2    |username2|name2|
|https://www.example.com/user=/username3/name1/asd|username3|name1|
|https://www.example.com/user=                    |         |     | <- not correct url
+-------------------------------------------------+---------+-----+

PS you can use regex101.com for validating your regular expressions

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM