如何将 regexp_count 与 regexp_substr 一起使用到 SQL (Redshift) 中每个字符串的 output 多个匹配项？

Question

我有一个包含字符串列的表。 我想提取每个字符串中紧跟在某个 substring 之后的所有文本。对于这个最小的可重现示例，我们假设这个 substring 是abc 。 所以我想要abc之后的所有后续条款。

在每行只有 1 个abc的情况下，我能够实现这一点，但是当有多个abc时，我的逻辑就会失败。 我也得到了 substring 次出现的次数，但我无法将其与检索所有这些事件相关联。

我的方法/尝试：

我创建了一个临时表，其中包含我的主字符串中成功匹配的正则表达式的数量：

CREATE TEMP TABLE match_count AS (
SELECT DISTINCT id, main_txt, regexp_count(main_txt, 'abc (\\S+)', 1) AS cnt
FROM my_data_source
WHERE regexp_count(main_txt, 'abc (\\S+)', 1) > 0);

我的 output：

id   main_txt                         cnt
1    wpfwe abc weiofnew abc wieone    2
2    abc weoin                        1
3    abc weoifn abc we abc w          3

要获得我的最终 output，我有一个类似的查询：

SELECT id, main_txt, regexp_substr(main_txt, 'abc (\\S+)', 1, cnt, 'e') AS output
FROM match_count;

我的实际最终 output：

id   main_txt                         output
1    wpfwe abc weiofnew abc wieone    wieone
2    abc weoin                        weoin
3    abc weoifn abc we abc w          w

我预期的最终 output：

id   main_txt                         output
1    wpfwe abc weiofnew abc wieone    weiofnew
1    wpfwe abc weiofnew abc wieone    wieone
2    abc weoin                        weoin
3    abc weoifn abc we abc w          weoifn
3    abc weoifn abc we abc w          we
3    abc weoifn abc we abc w          w

所以我的代码只得到最后的匹配（出现 # = cnt的地方）。 我怎样才能修改它以包括每场比赛？

Answer 1

解决这个问题的一种方法是使用递归 CTE 为每个字符串制作一个匹配编号列表（因此如果有 2 个匹配项，它会生成其中包含 1 和 2 的行），然后将它们连接回主表作为regexp_substr的occurrence参数：

WITH RECURSIVE match_counts(id, match_count) AS (
  SELECT DISTINCT id, regexp_count(main_txt, 'abc (\\S+)', 1)
  FROM my_data_source
  WHERE regexp_count(main_txt, 'abc (\\S+)', 1) > 0
),
match_nums(id, match_num, match_count) AS (
  SELECT id, 1, match_count
  FROM match_counts
  UNION ALL
  SELECT id, match_num + 1, match_count
  FROM match_nums
  WHERE match_num < match_count
)
SELECT m.id, main_txt, regexp_substr(main_txt, 'abc (\\S+)', 1, match_num, 'e') AS output
FROM my_data_source m
JOIN match_nums n ON m.id = n.id
ORDER BY m.id, n.match_num

不幸的是，我无法访问 Redshift 来测试它，但是我已经在 Oracle（具有类似的正则表达式函数）上对其进行了测试并且它在那里工作： dbfiddle 上的 Oracle 演示。 请注意，Oracle 不支持regexp_substr的e参数，因此返回整个匹配项而不是组。 （编辑 - 它已被确认也适用于 Redshift，感谢@HaleemurAli）。

请注意，如果定界符abc可能合法地出现在单词的末尾，您应该在正则表达式的开头添加一个分词符（即\\babc (\\S+) ）以防止它匹配（例如） deabc 。

Answer 2

下面的解决方案不处理main_text连续出现abc的情况。

前任。

wpfwe abc abc abc weiofnew abc wieone

设置

CREATE TABLE test_hal_unnest (id int, main_text varchar (500));
INSERT INTO test_hal_unnest VALUES 
(1, 'wpfwe abc weiofnew abc wieone'),
(2, 'abc weoin'),
(3, 'abc weoifn abc we abc w');

通过将字符串拆分为单词的可能解决方案

假设您正在搜索字符串中单词abc之后的所有单词，则不一定必须使用正则表达式。 不幸的是，redshift 中的正则表达式支持不如 postgres 或其他一些数据库那么全面。 例如，您无法将与正则表达式模式匹配的所有子字符串提取到数组中，或根据正则表达式模式将字符串拆分为数组。

脚步：

使用分隔符' '将文本拆分为数组
unnest 数组与序数
使用LAG查找前一个数组元素，按单词索引排序
过滤前一个单词是abc的行

额外的列idx和prev_word留在最后的 output 中，以说明问题是如何解决的。 它们可能会毫无问题地从最终查询中删除

WITH text_split AS (
SELECT Id
, main_text
, SPLIT_TO_ARRAY(main_text, ' ') text_arr
FROM test_hal_unnest
)
, text_unnested AS (
SELECT ts.id
, ts.main_text
, ts.text_arr
, CAST(ta as VARCHAR) text_word -- converts super >> text
, idx -- this is the word index
FROM text_split ts
JOIN ts.text_arr ta AT idx 
  ON TRUE
-- ^^ array unnesting happens via joins

)
, with_prevword AS (
SELECT id
, main_text
, idx
, text_word
, LAG(text_word) over (PARTITION BY id ORDER BY idx) prev_word
FROM text_unnested
ORDER BY id, idx
)
SELECT *
FROM with_prevword
WHERE prev_word = 'abc';

output：

 id |           main_text           | idx | text_word | prev_word
----+-------------------------------+-----+-----------+-----------
  1 | wpfwe abc weiofnew abc wieone |   2 | weiofnew  | abc
  1 | wpfwe abc weiofnew abc wieone |   4 | wieone    | abc
  2 | abc weoin                     |   1 | weoin     | abc
  3 | abc weoifn abc we abc w       |   1 | weoifn    | abc
  3 | abc weoifn abc we abc w       |   3 | we        | abc
  3 | abc weoifn abc we abc w       |   5 | w         | abc
(6 rows)

关于具有序数的 unnest 数组的注意事项

引用关于这个主题的redshift 文档，因为它是隐藏的

在使用 AT 关键字遍历数组时，Amazon Redshift 还支持数组索引。 子句 x AS y AT z 遍历数组 x 并生成字段 z，它是数组索引。

通过在`abc`上拆分来替代较短的解决方案

使用 redsfhit 中可用的正则表达式功能可以更轻松地解决此问题

1, wpfwe abc weiofnew abc wieone

源数据已在abc上拆分为多行

1, wpfwe
1, abc weiofnew
1, abc wieone

该解决方案首先通过拆分 abc 来扩展源数据。 然而，由于 split_to_array 不接受正则表达式模式，我们首先注入一个定界符; 在abc之前，然后在 split on 上; .

任何定界符都可以使用，只要保证它不会出现在列main_text中

WITH text_array AS (
SELECT
  id
, main_text
, SPLIT_TO_ARRAY(REGEXP_REPLACE(main_text, 'abc ', ';abc '), ';') array
FROM test_hal_unnest
)
SELECT
  ta.id
, ta.main_text
, REGEXP_SUBSTR(CAST(st AS VARCHAR), 'abc (\\S+)', 1, 1, 'e') output
FROM text_array ta
JOIN ta.array st ON TRUE
WHERE st LIKE 'abc%';

如何将 regexp_count 与 regexp_substr 一起使用到 SQL (Redshift) 中每个字符串的 output 多个匹配项？

问题描述

2 个解决方案

解决方案1
3 2022-03-19 07:16:44

解决方案2
2 已采纳 2022-03-19 11:39:48

设置

通过将字符串拆分为单词的可能解决方案

关于具有序数的 unnest 数组的注意事项

通过在`abc`上拆分来替代较短的解决方案

如何将 regexp_count 与 regexp_substr 一起使用到 SQL (Redshift) 中每个字符串的 output 多个匹配项？

问题描述

2 个解决方案

解决方案1 3 2022-03-19 07:16:44

解决方案2 2 已采纳 2022-03-19 11:39:48

设置

通过将字符串拆分为单词的可能解决方案

关于具有序数的 unnest 数组的注意事项

通过在abc上拆分来替代较短的解决方案

解决方案1
3 2022-03-19 07:16:44

解决方案2
2 已采纳 2022-03-19 11:39:48

通过在`abc`上拆分来替代较短的解决方案