简体   繁体   中英

Extract names from string with python Regex

I've been trying to extract names from a string, but don't seem to be close to success.

Here is the code:

string = "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
regex = re.compile(r'([A-Z][a-z]+(?: [A-Z][a-z]\.)? [A-Z][a-z]+)')
print(regex.findall(string))

This is the output I'm getting:

['Moe Szyslak', 'Timothy Lovejoy', 'Ned Flanders', 'Julius Hibbert']

Fancy regexes take time to compose and are difficult to maintain. In this case, I'd tend to keep it simple:

re.findall(r"[^()0-9-]+", string)

output:

['Moe Szyslak', ' ', 'Burns, C. Montgomery', ' ', 'Rev. Timothy Lovejoy', ' ', 'Ned Flanders', 'Simpson, Homer', 'Dr. Julius Hibbert']

If the blanks are an issue, I'd filter the list(filter(str.strip,list))

Extracting human names even in English is notoriously hard. The following regex solves your particular problem but may fail on other inputs (eg, it does not capture names with dashes):

re.findall(r"[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+", string)
#['Moe Szyslak', 'Burns, C. Montgomery', 'Timothy Lovejoy', 
# 'Ned Flanders', 'Simpson, Homer', 'Julius Hibbert']

And with titles:

TITLE = r"(?:[A-Z][a-z]*\.\s*)?"
NAME1 = r"[A-Z][a-z]+,?\s+"
MIDDLE_I = r"(?:[A-Z][a-z]*\.?\s*)?"
NAME2 = r"[A-Z][a-z]+"

re.findall(TITLE + NAME1 + MIDDLE_I + NAME2, string)
#['Moe Szyslak', 'Burns, C. Montgomery', 'Rev. Timothy Lovejoy', 
# 'Ned Flanders', 'Simpson, Homer', 'Dr. Julius Hibbert']

As a side note, there is no need to compile a regex unless you plan to reuse it.

Here is one approach using zero width lookarounds to isolate each name:

string = "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
result = re.findall(r'(?:(?<=^)|(?<=[^A-Za-z.,]))[A-Za-z.,]+(?: [A-Za-z.,]+)*(?:(?=[^A-Za-z.,])|(?=$))', string)

print(result)

['Moe Szyslak', 'Burns, C. Montgomery', 'Rev. Timothy Lovejoy', 'Ned Flanders',
 'Simpson, Homer', 'Dr. Julius Hibbert']

The actual pattern matched is this:

[A-Za-z.,]+(?: [A-Za-z.,]+)*

This says to match any uppercase or lowercase letter, dot, or period, followed by a space and one or more of the same character, zero or more times.

In addition, we use the following lookarounds on the left and right of this pattern:

(?:(?<=^)|(?<=[^A-Za-z.,]))
Lookbehind and assert either the start of the string, or a non matching character
(?:(?=[^A-Za-z.,])|(?=$))
Lookahead and asser either the end of the string or a non matching character

I am extracting entities for instance names with spacy in no time. With spacy you can rely on pretrained language models, which have a massive knowledge about common names and titles.

  1. Step: set up spacy and download pretrained English language model import spacy
    import en_core_web_sm nlp = en_core_web_sm.load()
    import spacy
    import en_core_web_sm nlp = en_core_web_sm.load()

  2. Step: create spacy document doc = nlp('555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555 -6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert')

  3. Step: get tags for all tokens in document which are labelled as person print([(X.text, X.label_) for X in doc.ents if X.label_ == PERSON])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM