繁体   English   中英

如何使用正则表达式在python中提取关键字列表后的单词?

[英]how to extract words following a list of keywords in python using regex?

我试图在python中使用Regex提取位置。 现在我这样做:

def get_location(s):
    s = s.strip(STRIP_CHARS)
    keywords = "at|outside|near"
    location_pattern = "(?P<location>((?P<place>{keywords}\s[A-Za-z]+)))".format(keywords = keywords)
    location_regex = re.compile(location_pattern, re.IGNORECASE | re.MULTILINE | re.UNICODE | re.DOTALL | re.VERBOSE)

    for match in location_regex.finditer(s):
        match_str = match.group(0)
        indices = match.span(0)
        print ("Match", match)
        match_str = match.group(0)
        indices = match.span(0)
        print (match_str)

get_location("Im at building 3")

我有三个问题:

  1. 它只是给出“at”作为输出,但它也应该给予建设。
  2. captures = match.capturesdict()我无法用它来提取其他示例中的捕获。
  3. 当我正在做这个location_pattern = 'at|outside\\s\\w+ 它似乎工作。 有人能解释我做错了吗?

这里的主要问题是您需要将{keywords}放在非捕获组中: (?:{keywords}) 下面是一个示意图: a|b|c\\s+\\w+匹配abc + <whitespace(s)> + . When you put the alternation list into a group, . When you put the alternation list into a group, (a | b | c)\\ s + \\ w +时, it matches either a , or b or c`,然后它才会尝试匹配空格,然后匹配单词chars。

查看更新的代码( 在线演示 ):

import regex as re
def get_location(s):
    STRIP_CHARS = '*'
    s = s.strip(STRIP_CHARS)
    keywords = "at|outside|near"
    location_pattern = "(?P<location>((?P<place>(?:{keywords})\s+[A-Za-z]+)))".format(keywords = keywords)
    location_regex = re.compile(location_pattern, re.IGNORECASE | re.UNICODE)

    for match in location_regex.finditer(s):
        match_str = match.group(0)
        indices = match.span(0)
        print ("Match", match)
        match_str = match.group(0)
        indices = match.span(0)
        print (match_str)
        captures = match.capturesdict()
        print(captures)

get_location("Im at building 3")

输出:

('Match', <regex.Match object; span=(3, 14), match='at building'>)
at building
{'place': ['at building'], 'location': ['at building']}

请注意, location_pattern = 'at|outside\\s\\w+不起作用,因为at匹配到处,并且outside必须跟随空格和单词字符。 您可以用同样的方法修复它:( (at|outside)\\s\\w+

如果将关键字放入组中,则the captures = match.capturesdict()将运行良好(请参阅上面的输出)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM