简体   繁体   English

如何在 Spark SQL 中执行正则表达式

[英]How to do regEx in Spark SQL

I have to create a data frame in which the rows in one column should be a name I extract from a long URL.我必须创建一个数据框,其中一列中的行应该是我从长 URL 中提取的名称。 Let's say I have the following url:假设我有以下网址:

https://xxx.xxxxxx.com/xxxxx/y...y/?...?/<irrelevant>

Now unfortunately I can't disclose the exact URLs but what I can say is that the letters x contain strings that don't change (ie all URLs in the database contain those patterns and are known), the y...y is an username that is unknown, with unknown length and may change with each URL and the ?...?现在不幸的是,我不能透露确切的 URL,但我可以说的是,字母x包含不变的字符串(即数据库中的所有 URL 都包含这些模式并且是已知的), y...y是一个未知长度的未知用户名,可能会随着每个 URL 和?...? is the name in which I am interested in (again a string with unknown length).是我感兴趣的名称(又是一个长度未知的字符串)。 After that there may be multiple strings separated by / which are not useful.之后可能会有多个由/分隔的字符串,这些字符串没有用。 How exactly would I do that?我该怎么做? Up until now I used to do three different UDFs which use substrings and indexes but I think that's a very cumbersome solution.到目前为止,我曾经做过三种不同的 UDF,它们使用子字符串和索引,但我认为这是一个非常麻烦的解决方案。

I am not very familiar with Regex or with Spark SQL, so even just the regex would be useful.我对 Regex 或 Spark SQL 不是很熟悉,所以即使只是 regex 也会很有用。

Thanks谢谢

Edit: I think I got the regex down, now I just need to find out how to use it.编辑:我想我搞定了正则表达式,现在我只需要找出如何使用它。

https:\/\/xxx\.xxxxxx\.com\/xxxxx\/(?:[^0-9\/]+)\/([a-zA-z]*)

I have a bit modified your regex.我对你的正则表达式做了一些修改。 Regex:正则表达式:

^https:\/\/www\.example\.com\/user=\/(.*?)\/(.*?)(?:\/.*|$)$

It will capture two groups:它将捕获两组:

  • 1st group - username第一组 - 用户名
  • 2nd group - some name第二组 - 一些名字

You can use regexp_extract spark function for selecting regex capture groups.您可以使用regexp_extract spark 函数来选择正则表达式捕获组。 Eg例如

import spark.implicits._
import org.apache.spark.sql.functions.regexp_extract

val df = Seq(
    ("https://www.example.com/user=/username1/name3/asd"),
    ("https://www.example.com/user=/username2/name2"),
    ("https://www.example.com/user=/username3/name1/asd"),
    ("https://www.example.com/user=")
).toDF("url")

val r = "^https:\\/\\/www\\.example\\.com\\/user=\\/(.*?)\\/(.*?)(?:\\/.*|$)$"

df.select(
    $"url",
    regexp_extract($"url", r, 1).as("username"),
    regexp_extract($"url", r, 2).as("name")
).show(false)

Result:结果:

+-------------------------------------------------+---------+-----+
|url                                              |username |name |
+-------------------------------------------------+---------+-----+
|https://www.example.com/user=/username1/name3/asd|username1|name3|
|https://www.example.com/user=/username2/name2    |username2|name2|
|https://www.example.com/user=/username3/name1/asd|username3|name1|
|https://www.example.com/user=                    |         |     | <- not correct url
+-------------------------------------------------+---------+-----+

PS you can use regex101.com for validating your regular expressions PS 您可以使用 regex101.com 来验证您的正则表达式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM