如何在 Spark SQL 中執行正則表達式

Question

我必須創建一個數據框，其中一列中的行應該是我從長 URL 中提取的名稱。 假設我有以下網址：

https://xxx.xxxxxx.com/xxxxx/y...y/?...?/<irrelevant>

現在不幸的是，我不能透露確切的 URL，但我可以說的是，字母x包含不變的字符串（即數據庫中的所有 URL 都包含這些模式並且是已知的）， y...y是一個未知長度的未知用戶名，可能會隨着每個 URL 和?...? 是我感興趣的名稱（又是一個長度未知的字符串）。 之后可能會有多個由/分隔的字符串，這些字符串沒有用。 我該怎么做？ 到目前為止，我曾經做過三種不同的 UDF，它們使用子字符串和索引，但我認為這是一個非常麻煩的解決方案。

我對 Regex 或 Spark SQL 不是很熟悉，所以即使只是 regex 也會很有用。

謝謝

編輯：我想我搞定了正則表達式，現在我只需要找出如何使用它。

https:\/\/xxx\.xxxxxx\.com\/xxxxx\/(?:[^0-9\/]+)\/([a-zA-z]*)

Answer 1

我對你的正則表達式做了一些修改。 正則表達式：

^https:\/\/www\.example\.com\/user=\/(.*?)\/(.*?)(?:\/.*|$)$

它將捕獲兩組：

第一組 - 用戶名
第二組 - 一些名字

您可以使用regexp_extract spark 函數來選擇正則表達式捕獲組。 例如

import spark.implicits._
import org.apache.spark.sql.functions.regexp_extract

val df = Seq(
    ("https://www.example.com/user=/username1/name3/asd"),
    ("https://www.example.com/user=/username2/name2"),
    ("https://www.example.com/user=/username3/name1/asd"),
    ("https://www.example.com/user=")
).toDF("url")

val r = "^https:\\/\\/www\\.example\\.com\\/user=\\/(.*?)\\/(.*?)(?:\\/.*|$)$"

df.select(
    $"url",
    regexp_extract($"url", r, 1).as("username"),
    regexp_extract($"url", r, 2).as("name")
).show(false)

結果：

+-------------------------------------------------+---------+-----+
|url                                              |username |name |
+-------------------------------------------------+---------+-----+
|https://www.example.com/user=/username1/name3/asd|username1|name3|
|https://www.example.com/user=/username2/name2    |username2|name2|
|https://www.example.com/user=/username3/name1/asd|username3|name1|
|https://www.example.com/user=                    |         |     | <- not correct url
+-------------------------------------------------+---------+-----+

PS 您可以使用 regex101.com 來驗證您的正則表達式

如何在 Spark SQL 中執行正則表達式

問題描述

1 個解決方案

解決方案1
1 2021-10-22 20:22:55

如何在 Spark SQL 中執行正則表達式

問題描述

1 個解決方案

解決方案1 1 2021-10-22 20:22:55

解決方案1
1 2021-10-22 20:22:55