简体   繁体   English

使用正则表达式创建具有正向回溯的字典列表

[英]Using regex to create a list of dictionaries with positive lookbehind

I am trying to create a list of dictionaries using regex positive lookbehind.我正在尝试使用正则表达式肯定后向创建字典列表。 I tried two different codes:我尝试了两种不同的代码:

Variation 1变体 1

string = '146.204.224.152 - lubo233'

for item in re.finditer( "(?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)(?P<user_name>(?<= - )[a-z]*[0-9]*)", string ):
    print(item.groupdict())

Variation 2变体 2

string = '146.204.224.152 - lubo233'
for item in re.finditer( "(?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)(?<= - )(?P<user_name>[a-z]*[0-9]*)", string ):
    print(item.groupdict())

Desired Output所需 Output

{'host': '146.204.224.152', 'user_name': 'lubo233'}

Question/Issue问题/问题

In both cases, I am unable to eliminate the substring " - ".在这两种情况下,我都无法消除 substring“-”。

The use of positive lookbehind (?<= - ) renders my code wrong.使用积极的后视(?<= - )会使我的代码出错。

Can anyone assist to identify my mistake?任何人都可以帮助确定我的错误吗? Thanks.谢谢。

I'd suggest you remove the positive lookbehind and just put the join character normally, between each parts我建议您删除积极的后视,并在每个部分之间正常放置连接字符

Also some improvements还有一些改进

  • \. instead of [.]而不是[.]

  • [0-9]{,3} instead of [0-9]* [0-9]{,3}而不是[0-9]*

  • (?:\.[0-9]{,3}){3} instead of \.[0-9]{,3}\.[0-9]{,3}\.[0-9]{,3} (?:\.[0-9]{,3}){3}而不是\.[0-9]{,3}\.[0-9]{,3}\.[0-9]{,3}

Add a .* along with the - to handle any word that could be there添加.*-以处理可能存在的任何单词

rgx = re.compile(r"(?P<host>[0-9]{,3}(?:\.[0-9]{,3}){3}).* - (?P<user_name>[a-z]*[0-9]*)")

vals = ['146.204.224.152 aw0123 abc - lubo233',
        '146.204.224.152 as003443af - lubo233',
        '146.204.224.152 - lubo233']

for val in vals:
    for item in rgx.finditer(val):
        print(item.groupdict())

# Gives
{'host': '146.204.224.152', 'user_name': 'lubo233'}
{'host': '146.204.224.152', 'user_name': 'lubo233'}
{'host': '146.204.224.152', 'user_name': 'lubo233'}

The reason that the positive lookbehind is not working is that you are trying to match:积极向后看不起作用的原因是您正在尝试匹配:

  • (?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*) an IP address (?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)一个IP 地址
  • immediately followed by a user name pattern : (?P<user_name>(?<= - )[az]*[0-9]*) that should be preceded by (?<= - )紧随其后的用户名模式(?P<user_name>(?<= - )[az]*[0-9]*)前面应该是(?<= - )

So once the regex engine has consumed the IP address pattern you are telling that should match a user name pattern preceded by (?<= - ) but what is preceding is the IP address pattern.因此,一旦正则表达式引擎使用了IP 地址模式,您就会告诉它应该匹配一个以(?<= - )开头的用户名模式,但前面的是IP 地址模式。 In other terms, once the IP pattern has been matched the string left is:换句话说,一旦匹配了IP 模式,左边的字符串就是:

- lubo233

The pattern that should be immediately matched, as in re.match , is:应该立即匹配的模式,如re.match ,是:

(?P<user_name>(?<= - )[a-z]*[0-9]*) 

that obviously does not match.那显然不匹配。 To illustrate my point, see that this pattern works:为了说明我的观点,请查看此模式是否有效:

import re

string = '146.204.224.152 - lubo233'
for item in re.finditer(r"((?P<host>[0-9]*[.][0-9]*[.][0-9]*[.][0-9]*)( - ))(?P<user_name>(?<= - )[a-z]*[0-9]*)", string):
    print(item.groupdict())

Output Output

{'host': '146.204.224.152', 'user_name': 'lubo233'}

If you need to match an arbitrary number of characters between the two patterns, you could do:如果您需要在两种模式之间匹配任意数量的字符,您可以这样做:

import re

string = '146.204.224.152 adfadfa - lubo233'
for item in re.finditer(r"((?P<host>\d{3,}[.]\d{3,}[.]\d{3,})(.* - ))(?P<user_name>(?<= - )[a-z]*[0-9]*)", string):
    print(item.groupdict())

Output Output

{'host': '146.204.224', 'user_name': 'lubo233'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM