[英]Convert multiple values from column to rows using Regex
I am new to the world of Regex and trying to extract a pattern from a free form text column of a table.我是 Regex 世界的新手,并试图从表格的自由格式文本列中提取模式。
There are two things that I am trying to achieve 1.Extract multiple occurrences of a pattern.我试图实现两件事 1.提取多次出现的模式。 The pattern that I am trying to extract is of a URL which is one that starts with http or https.
我试图提取的模式是 URL,它以 http 或 https 开头。 2. After finding multiple occurrences of the URL I would have to explode them to multiple rows.
2. 在找到多次出现的 URL 后,我必须将它们分解为多行。
Input Table输入表
ip_table (user_id, notes) ip_table (user_id, 注释)
(123, 'Here are notes - he owns url https://123.com/asda/32/1221 and http://www.facebook.com/page1 so on')
(234, 'this one has http://www.instagram.com/page3/12321213 (https://example.com/1233/qwerty)
Output Table Output表
op_table(user_id, urls) op_table(user_id, urls)
(123, 'https://123.com/asda/32/1221')
(123, 'http://www.facebook.com/page1')
(234, 'http://www.instagram.com/page3/12321213')
(234, 'https://example.com/1233/qwerty')
Here is what I have so far for the regexp, with no success.这是我到目前为止的正则表达式,但没有成功。
select user_id, regexp_substr(notes, '(https?)://.*[\s]')
Can you please help give me some direction on how I could find repeated patterns of url patterns?你能帮我指导一下如何找到 url 模式的重复模式吗? The only thing I would need to check is if there is something that starts with http|https and capture that pattern(s) and repear it multiple times in the notes column.
我唯一需要检查的是是否有以 http|https 开头的内容并捕获该模式并在注释列中重复多次。 Once I find that string I would have to explode that to multiple rows with the matching user ids.
一旦找到该字符串,我将不得不将其分解为具有匹配用户 ID 的多行。
Try this:尝试这个:
WITH
input(id,str) AS (
SELECT 123, 'Here are notes - he owns url https://123.com/asda/32/1221 and http://www.facebook.com/page1 so on'
UNION ALL SELECT 234, 'this one has http://www.instagram.com/page3/12321213 (https://example.com/1233/qwerty)'
)
-- create a series of 4 integers in Vertica
-- keep this code-snippet as you'll need it often
-- without Vertica, UNION SELECT 1, 2, 3 and 4
,l(l) AS (
SELECT TIMESTAMPADD(us, 1 , DATE '2000-01-01' )
UNION ALL SELECT TIMESTAMPADD(us, 4 , DATE '2000-01-01' )
)
,i(i) AS (
SELECT
MICROSECOND(ts)
FROM l
TIMESERIES ts AS '1 us' OVER(ORDER BY l)
)
-- end create series of 4 integers: CTE i, column i.
-- verticalise the series of tokens found - one row per URL
-- CROSS JOIN input with series of integers to extract the i-th occurrence
SELECT
id
, i
, regexp_substr(str,'https?://[^\s):"'']+',1, i ) AS url_found
FROM input CROSS JOIN i
WHERE regexp_substr(str,'https?://[^\s):"'']+',1, i ) IS NOT NULL
ORDER BY 1,2;
-- out id | i | url_found
-- out -----+---+-----------------------------------------
-- out 123 | 1 | https://123.com/asda/32/1221
-- out 123 | 2 | http://www.facebook.com/page1
-- out 234 | 1 | http://www.instagram.com/page3/12321213
-- out 234 | 2 | https://example.com/1233/qwerty
-- out (4 rows)
-- out
-- out Time: First fetch (4 rows): 12.555 ms. All rows formatted: 12.599 ms
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.