简体   繁体   English

使用正则表达式将多个值从列转换为行

[英]Convert multiple values from column to rows using Regex

I am new to the world of Regex and trying to extract a pattern from a free form text column of a table.我是 Regex 世界的新手,并试图从表格的自由格式文本列中提取模式。

There are two things that I am trying to achieve 1.Extract multiple occurrences of a pattern.我试图实现两件事 1.提取多次出现的模式。 The pattern that I am trying to extract is of a URL which is one that starts with http or https.我试图提取的模式是 URL,它以 http 或 https 开头。 2. After finding multiple occurrences of the URL I would have to explode them to multiple rows. 2. 在找到多次出现的 URL 后,我必须将它们分解为多行。

Input Table输入表

ip_table (user_id, notes) ip_table (user_id, 注释)

(123, 'Here are notes - he owns url https://123.com/asda/32/1221 and http://www.facebook.com/page1 so on')
(234, 'this one has http://www.instagram.com/page3/12321213 (https://example.com/1233/qwerty)

Output Table Output表

op_table(user_id, urls) op_table(user_id, urls)

(123, 'https://123.com/asda/32/1221')
(123, 'http://www.facebook.com/page1')
(234, 'http://www.instagram.com/page3/12321213')
(234, 'https://example.com/1233/qwerty')

Here is what I have so far for the regexp, with no success.这是我到目前为止的正则表达式,但没有成功。

select user_id, regexp_substr(notes, '(https?)://.*[\s]')

Can you please help give me some direction on how I could find repeated patterns of url patterns?你能帮我指导一下如何找到 url 模式的重复模式吗? The only thing I would need to check is if there is something that starts with http|https and capture that pattern(s) and repear it multiple times in the notes column.我唯一需要检查的是是否有以 http|https 开头的内容并捕获该模式并在注释列中重复多次。 Once I find that string I would have to explode that to multiple rows with the matching user ids.一旦找到该字符串,我将不得不将其分解为具有匹配用户 ID 的多行。

Try this:尝试这个:

WITH
input(id,str) AS (
          SELECT 123, 'Here are notes - he owns url https://123.com/asda/32/1221 and http://www.facebook.com/page1 so on'
UNION ALL SELECT 234, 'this one has http://www.instagram.com/page3/12321213 (https://example.com/1233/qwerty)'
)

-- create a series of 4 integers in Vertica 
-- keep this code-snippet as you'll need it often
-- without Vertica, UNION SELECT 1, 2, 3 and 4
,l(l) AS (
          SELECT TIMESTAMPADD(us,  1  , DATE '2000-01-01' )
UNION ALL SELECT TIMESTAMPADD(us,  4  , DATE '2000-01-01' )
)
,i(i) AS (
  SELECT
    MICROSECOND(ts)
  FROM l
  TIMESERIES ts AS '1 us' OVER(ORDER BY l)
)
-- end create series of 4 integers: CTE i, column i.

-- verticalise the series of tokens found - one row per URL
-- CROSS JOIN input with series of integers to extract the i-th occurrence
SELECT
  id
, i
, regexp_substr(str,'https?://[^\s):"'']+',1, i ) AS url_found
FROM input CROSS JOIN i
WHERE regexp_substr(str,'https?://[^\s):"'']+',1, i ) IS NOT NULL
ORDER BY 1,2;
-- out  id  | i |                url_found                
-- out -----+---+-----------------------------------------
-- out  123 | 1 | https://123.com/asda/32/1221
-- out  123 | 2 | http://www.facebook.com/page1
-- out  234 | 1 | http://www.instagram.com/page3/12321213
-- out  234 | 2 | https://example.com/1233/qwerty
-- out (4 rows)
-- out 
-- out Time: First fetch (4 rows): 12.555 ms. All rows formatted: 12.599 ms

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM