简体   繁体   English

正则表达式:查找两个特定单词之间的组的所有出现

[英]regex: find all occurences of a group between two specific words

python version: Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] on linux python版本:Linux 上的 Python 3.10.4(主要,2022 年 3 月 31 日,08:41:55)[GCC 7.5.0]
re version: 2.2.1 re版本:2.2.1

I want to get all occurrences of a regex's group between two specific words.我想在两个特定单词之间获取所有出现的正则表达式组。

First, the different expressions my program will encounter ( ... means this in a vaster text):首先,我的程序将遇到的不同表达式( ...在更宽的文本中表示这个):

  • "...For colts, geldings and fillies of..."
  • "...For horses, geldings and mares of..."
  • "...For colts and geldings of..."
  • "...For colts and fillies of..."
  • "...For horses and geldings of..."
  • "...For colts of..."
  • "...For fillies of..."
  • "...For horses of..."
  • "...For mares of..."

The finality of the program is to get all the mentions of "colts", "geldings", "fillies", "horses","mares" between the words "For" and "of".该程序的最终结果是在“For”和“of”这两个词之间获得所有提到的“colts”、“geldings”、“fillies”、“horses”、“mares”。 Concretely I want 3 groups if there is 3 mentions, 2 groups for two mentions, and 1 group for one mention.具体来说,如果有 3 次提及,我想要 3 组,2 次提及需要 2 组,1 次提及需要 1 组。

len(re.search(a_regex_pattern,"...For colts, geldings and fillies of...").groups())
>>> 3 # 3 groups
re.search(a_regex_pattern,"...For colts, geldings and fillies of...").groups()
>>> ['colts','geldings','fillies']

Where I am stuck is to find the right a_regex_pattern to do it.我被困的地方是找到正确的a_regex_pattern来做这件事。

I tried it:我尝试过这个:

a_regex_expression = "For.*?(colts|geldings|fillies){1,3}.*?of"
re.search(a_regex_pattern,"...For colts, geldings and fillies of...").groups()
>>> ['fillies']

Other tries are worse.其他尝试更糟糕。 How would you do it ?你会怎么做?

I'd do it in two steps:我会分两步完成:

  • in first step I search for everything between For ... of在第一步中,我搜索For ... of之间的所有内容
  • in second step I extract the words from the first step在第二步中,我从第一步中提取单词
import re

tests = [
    "... For colts, geldings and fillies of ...",
    "... For horses, geldings and mares of ...",
    "... For colts and geldings of ...",
    "... For colts and fillies of ...",
    "... For horses and geldings of ...",
    "... For colts of ...",
    "... For fillies of ...",
    "... For horses of ...",
    "... For mares of ...",
]

pat1 = re.compile(r"\bFor\s+(.*?)\s+of\b")
pat2 = re.compile(r",|\band\b")

for t in tests:
    m = pat1.search(t)
    if m:
        print(pat2.sub(" ", m.group(1)).split())

Prints:印刷:

['colts', 'geldings', 'fillies']
['horses', 'geldings', 'mares']
['colts', 'geldings']
['colts', 'fillies']
['horses', 'geldings']
['colts']
['fillies']
['horses']
['mares']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM