简体   繁体   English

Impala / Hive - 从逗号分隔的字符串中提取文本,其中每个分隔符不匹配模式

[英]Impala / Hive - Extract text from comma delimited string where each separation doesnt match a pattern

is there a way in hive or impala to extract a string from a delimited string but only where the string i want doesnt match one or multiple patterns? hive 或 impala 有没有办法从分隔字符串中提取字符串,但只有在我想要的字符串不匹配一个或多个模式的地方?

For instance, i have a field with IPs (the number varies depending on network adapters):例如,我有一个带有 IP 的字段(数量因网络适配器而异):

169.254.182.175,192.168.0.1,10.199.44.111

I would like to extract the IP that doesnt start with 169.254.我想提取不以 169.254 开头的 IP。 (there could be many of these) and doesnt equal 192.168.0.1 (可能有很多)并且不等于 192.168.0.1

The IPs can be in any order as well. IP 也可以按任何顺序排列。

I tried doing substr with nested cases but due the unknown number of ips in the string it didnt work out.我尝试使用嵌套案例执行 substr,但由于字符串中的 ips 数量未知,它没有解决。

Could this be accomplished with regex_extract or something similar?这可以用 regex_extract 或类似的东西来完成吗?

Thanks,谢谢,

You may use regexp_replace with capturing group for patterns that you do not want to keep and specify only groups of interest in the replacement string.您可以将regexp_replace与捕获组一起用于您不想保留的模式,并仅在替换字符串中指定感兴趣的组。

See example below in Impala (impalad version 3.4.0):请参阅以下 Impala 中的示例(impalad 版本 3.4.0):

 select addr_list, /*Concat is used just for visualization*/ rtrim(ltrim(regexp_replace(addr_list,concat( /*Group of 169.254.*.* that should be excluded*/ '(169\\.254\\.\\d{1,3}\\.\\d{1,3})', '|', /*Another group for 192.168.0.1*/ '(192\.168\.0\.1)', '|', /*And the group that we need to keep*/ '(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})' /*So keep the third group in the output. Other groups will be replaced with empty string*/ ), '\\3'), ','), ',') as ip_whitelist from(values ('169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2' as addr_list), ('10.58.3.142,169.254.2.12'), ('192.168.0.1,192.100.0.2,154.16.171.3') ) as t
addr_list addr_list ip_whitelist ip_whitelist
169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2 169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2 10.199.44.111 10.199.44.111
10.58.3.142,169.254.2.12 10.58.3.142,169.254.2.12 10.58.3.142 10.58.3.142
192.168.0.1,192.100.0.2,154.16.171.3 192.168.0.1,192.100.0.2,154.16.171.3 192.100.0.2,154.16.171.3 192.100.0.2,154.16.171.3

regexp_extract works differently for unknown reason, because the same regex with 3 as return group doesn't return anything at all for case 1 and 3.由于未知原因, regexp_extract的工作方式不同,因为对于情况 1 和 3,具有 3 作为返回组的相同正则表达式根本不会返回任何内容。

 select t.addr_list, rtrim(ltrim(regexp_replace(addr_list, r.regex, '\\3'), ','), ',') as ip_whitelist, regexp_extract(addr_list, r.regex, 3) as ip_wl_extract from(values ('169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2' as addr_list), ('10.58.3.142,169.254.2.12'), ('192.168.0.1,192.100.0.2,154.16.171.3') ) as t cross join ( select concat( '(169\\.254\\.\\d{1,3}\\.\\d{1,3})', '|', '(192\.168\.0\.1)', '|', '(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})' ) as regex ) as r
addr_list addr_list ip_whitelist ip_whitelist ip_wl_extract ip_wl_extract
169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2 169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2 10.199.44.111 10.199.44.111
10.58.3.142,169.254.2.12 10.58.3.142,169.254.2.12 10.58.3.142 10.58.3.142 10.58.3.142 10.58.3.142
192.168.0.1,192.100.0.2,154.16.171.3 192.168.0.1,192.100.0.2,154.16.171.3 192.100.0.2,154.16.171.3 192.100.0.2,154.16.171.3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM