简体   繁体   中英

Find next/previous string after match python regex

I need to find the name of persons that are mentioned in a text, I need filter all the names with a list of key_words, for example:

key_words = ["magistrate","officer","attorney","applicant","defendant","plaintfill"...]

For example, in the text:

INPUT: "The magistrate DANIEL SMITH blalblablal, who was in a meeting with the officer MARCO ANTONIO 
and WILL SMITH, defendant of the judgment filed by the plaintiff MARIA FREEMAN "

OUTPUT:
(magistrate, DANIEL SMITH)
(officer, MARCO ANTONIO)
(defendant, WILL SMITH)
(plaintfill, MARIA FREEMAN)

So I have two problems: First when the name is mentioned before the key and second how to build a regex for use all the keywords and filter at the same time.

There is something I have tried:

line = re.split("magistrate",text)[1]
name = []
for key in line.split():
    if key.isupper(): name.append(key)
    else:
        break
" ".join(name)
OUTPUT: 'DANIEL SMITH'

Thanks you!

Is it compulsory to use regex? If not this is my answer, because we can still do this without regex

You can just split the line with a whitespace separator using the split() method. This method return a list, assign that to a variable and iterate through that list. Try this

key_words = ["magistrate","officer","attorney","applicant","defendant","plaintfill"]

line = "The magistrate DANIEL SMITH blalblablal, who was in a meeting with the officer MARCO ANTONIO and WILL SMITH, defendant of the judgment filed by the plaintiff MARIA FREEMAN"
line_words = line.split(" ")

for word in line_words:
    if word in key_words:
        Index = line_words.index(word)
        print(word, line_words[Index+1], line_words[Index+2])

I suggest using re.findall with two capture groups, following way:

import re
key_words = ["magistrate","officer","attorney","applicant","defendant","plaintiff"]
line = "The magistrate DANIEL SMITH blalblablal, who was in a meeting with the officer MARCO ANTONIO and WILL SMITH, defendant of the judgment filed by the plaintiff MARIA FREEMAN "
found = re.findall('('+'|'.join(key_words)+')'+r'\s+([ A-Z]+[A-Z])',line)
print(found)

Output:

[('magistrate', 'DANIEL SMITH'), ('officer', 'MARCO ANTONIO'), ('plaintiff', 'MARIA FREEMAN')]

Explanation: using multiple capturing groups in pattern for re.findall (denoted by ( and ) ) result in list of tuple s (2-tuples in this case). First group is simply created by joining using | which work like OR in pattern, then we have one or more whitespaces ( \s+ ) which is outside any group and thus will not appear in result, finally we have second group which consist of one or more space or ASCII uppercase later ( [ AZ]+ ) followed by single ASCII uppercase letter ( [AZ] ), so it would not catch trailing space.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM