重復捕獲pattern_X，然后捕獲pattern_Y一次，然后重復直到EOS

Question

我正在對一個龐大的 ETL 管道進行逆向工程，我想從存儲過程和視圖中提取完整的數據沿襲。

我正在努力使用以下正則表達式。

import re

select_clause = "`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`) AS `birth_date`,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`) AS `death_date`,(case when (isnull(`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`)) and (`landing_boggle_replica`.`CUST`.`u_cust_type` <> 'E')) then timestampdiff(YEAR,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`),curdate()) else NULL end) AS `age_in_years`,nullif(`landing_boggle_replica`.`CUST`.`u_occupationCode`,'') AS `occupation_code`,nullif(`landing_boggle_replica`.`CUST`.`u_industryCode`,'') AS `industry_code`,((`landing_boggle_replica`.`CUST`.`u_intebank` = 'Y') or (`sso`.`u_mySecondaryCust` is not null)) AS `online_web_enabled`,(`landing_boggle_replica`.`CUST`.`u_telebank` = 'Y') AS `online_phone_enabled`,(`landing_boggle_replica`.`CUST`.`u_hasProBank` = 1) AS `has_pro_bank`"

# this captures every occurrence of the source fields, but not the target
okay_pattern = r"(?i)((`[a-z0-9_]+`\.`[a-z0-9_]+`)[ ,\)=<>]).*?"

# this captures the target too, but captures only the first input field
wrong_pattern = r"(?i)((((`[a-z0-9_]+`\.`[a-z0-9_]+`)[ ,\)=<>]).*?AS (`[a-z0-9_]+)`).*?)"

re.findall(okay_pattern, select_clause)
re.findall(wrong_pattern, select_clause)

TLDR：我想捕捉

[aaa, bbb, XXX],
[eee, fff, ..., ooo, YYY],
[ppp, ZZZ]

從像這樣的字符串

"...aaa....bbb...XXX....eee...fff...[many]...ooo... YYY...ppp...ZZZ...."

其中a,b,e,f,h匹配一個模式， X,Y,Z匹配另一個模式，第一個模式可能最多出現約 20 次，而第二個模式總是單獨出現。

我也對使用sqlglot 、 sql-metadata或sqlparse庫的解決方案持開放態度，只是正則表達式有更好的文檔記錄。

（可能我在打代碼，我應該分幾個步驟來做，從將字符串拆分成單獨的表達式開始。）

Answer 1

您可以將此正則表達式與 3 個捕獲組和 1 個非捕獲組一起使用：

(\w+)\.+(\w+)(?:\.+(\w+))?

正則表達式演示

代碼：

import re
s = '...aaa....bbb...XXX....eee...fff...YYY...hhh...ZZZ....'
print (re.findall(r'(\w+)\.+(\w+)(?:\.+(\w+))?', s))

Output：

[('aaa', 'bbb', 'XXX'), ('eee', 'fff', 'YYY'), ('hhh', 'ZZZ', '')]

Answer 2

這是兩個正則表達式，一個按外部模式對事物進行分組，一個用於內部：

(.*?)(XXX|YYY|ZZZ)
(aaa|bbb|ccc|ddd|eee|fff|ggg)

我建議將整個字符串與第一個正則表達式匹配，然后在第一個正則表達式的匹配項上使用第二個正則表達式(.*?)

通過使用這兩個正則表達式，您的匹配項將首先按外部模式分組，然后按內部模式分組，但正則表達式本身不必過於復雜。

重復捕獲pattern_X，然后捕獲pattern_Y一次，然后重復直到EOS

問題描述

2 個解決方案

解決方案1
2 已采納 2022-08-29 03:28:28

解決方案2
0 2022-08-29 02:22:55

重復捕獲pattern_X，然后捕獲pattern_Y一次，然后重復直到EOS

問題描述

2 個解決方案

解決方案1 2 已采納 2022-08-29 03:28:28

解決方案2 0 2022-08-29 02:22:55

解決方案1
2 已采納 2022-08-29 03:28:28

解決方案2
0 2022-08-29 02:22:55