[英]Find next/previous string after match python regex
我需要查找文本中提到的人的姓名,我需要使用关键字列表过滤所有姓名,例如:
key_words = ["magistrate","officer","attorney","applicant","defendant","plaintfill"...]
For example, in the text:
INPUT: "The magistrate DANIEL SMITH blalblablal, who was in a meeting with the officer MARCO ANTONIO
and WILL SMITH, defendant of the judgment filed by the plaintiff MARIA FREEMAN "
OUTPUT:
(magistrate, DANIEL SMITH)
(officer, MARCO ANTONIO)
(defendant, WILL SMITH)
(plaintfill, MARIA FREEMAN)
所以我有两个问题:首先,在键之前提到名称,其次如何构建正则表达式以同时使用所有关键字和过滤器。
我尝试过一些事情:
line = re.split("magistrate",text)[1]
name = []
for key in line.split():
if key.isupper(): name.append(key)
else:
break
" ".join(name)
OUTPUT: 'DANIEL SMITH'
谢谢!
是否必须使用正则表达式? 如果不是,这就是我的答案,因为我们仍然可以在没有正则表达式的情况下做到这一点
您可以使用split()
方法使用空格分隔符拆分行。 此方法返回一个列表,将其分配给一个变量并遍历该列表。 尝试这个
key_words = ["magistrate","officer","attorney","applicant","defendant","plaintfill"]
line = "The magistrate DANIEL SMITH blalblablal, who was in a meeting with the officer MARCO ANTONIO and WILL SMITH, defendant of the judgment filed by the plaintiff MARIA FREEMAN"
line_words = line.split(" ")
for word in line_words:
if word in key_words:
Index = line_words.index(word)
print(word, line_words[Index+1], line_words[Index+2])
我建议将re.findall
与两个捕获组一起使用,方法如下:
import re
key_words = ["magistrate","officer","attorney","applicant","defendant","plaintiff"]
line = "The magistrate DANIEL SMITH blalblablal, who was in a meeting with the officer MARCO ANTONIO and WILL SMITH, defendant of the judgment filed by the plaintiff MARIA FREEMAN "
found = re.findall('('+'|'.join(key_words)+')'+r'\s+([ A-Z]+[A-Z])',line)
print(found)
Output:
[('magistrate', 'DANIEL SMITH'), ('officer', 'MARCO ANTONIO'), ('plaintiff', 'MARIA FREEMAN')]
说明:在re.findall
的模式中使用多个捕获组(由(
和)
表示)导致tuple
列表(在这种情况下为 2 元组)。 第一个组是通过使用|
加入简单地创建的。 它在模式中像 OR 一样工作,然后我们有一个或多个空格( \s+
),它在任何组之外,因此不会出现在结果中,最后我们有第二组,它由一个或多个空格或 ASCII 大写字母组成( [ AZ]+
) 后跟单个 ASCII 大写字母 ( [AZ]
),因此它不会捕获尾随空格。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.