简体   繁体   中英

Find next/previous string after match python regex

I need to find the name of persons that are mentioned in a text, I need filter all the names with a list of key_words, for example:

key_words = ["magistrate","officer","attorney","applicant","defendant","plaintfill"...]

For example, in the text:

INPUT: "The magistrate DANIEL SMITH blalblablal, who was in a meeting with the officer MARCO ANTONIO 
and WILL SMITH, defendant of the judgment filed by the plaintiff MARIA FREEMAN "

(magistrate, DANIEL SMITH)
(officer, MARCO ANTONIO)
(defendant, WILL SMITH)
(plaintfill, MARIA FREEMAN)

So I have two problems: First when the name is mentioned before the key and second how to build a regex for use all the keywords and filter at the same time.

There is something I have tried:

line = re.split("magistrate",text)[1]
name = []
for key in line.split():
    if key.isupper(): name.append(key)
" ".join(name)

Thanks you!

Is it compulsory to use regex? If not this is my answer, because we can still do this without regex

You can just split the line with a whitespace separator using the split() method. This method return a list, assign that to a variable and iterate through that list. Try this

key_words = ["magistrate","officer","attorney","applicant","defendant","plaintfill"]

line = "The magistrate DANIEL SMITH blalblablal, who was in a meeting with the officer MARCO ANTONIO and WILL SMITH, defendant of the judgment filed by the plaintiff MARIA FREEMAN"
line_words = line.split(" ")

for word in line_words:
    if word in key_words:
        Index = line_words.index(word)
        print(word, line_words[Index+1], line_words[Index+2])

I suggest using re.findall with two capture groups, following way:

import re
key_words = ["magistrate","officer","attorney","applicant","defendant","plaintiff"]
line = "The magistrate DANIEL SMITH blalblablal, who was in a meeting with the officer MARCO ANTONIO and WILL SMITH, defendant of the judgment filed by the plaintiff MARIA FREEMAN "
found = re.findall('('+'|'.join(key_words)+')'+r'\s+([ A-Z]+[A-Z])',line)


[('magistrate', 'DANIEL SMITH'), ('officer', 'MARCO ANTONIO'), ('plaintiff', 'MARIA FREEMAN')]

Explanation: using multiple capturing groups in pattern for re.findall (denoted by ( and ) ) result in list of tuple s (2-tuples in this case). First group is simply created by joining using | which work like OR in pattern, then we have one or more whitespaces ( \s+ ) which is outside any group and thus will not appear in result, finally we have second group which consist of one or more space or ASCII uppercase later ( [ AZ]+ ) followed by single ASCII uppercase letter ( [AZ] ), so it would not catch trailing space.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM