Extract names of persons from filepath

Question

I'm trying to extract all the human names from a filepath. My approach is to split the filepath into individual words, then apply NTLK's Parts Of Speech tag to identify proper nouns, followed by the ne_chunk function to identify persons.

import nltk
import re


def extract_entities(y):
    #make an empty list to receive results of operation
    AggPeople = []
    #split the filepath by backslashes
    for y in y.split("\\"):
        #separate the product above into words, then attach nltk tags (ie. NNP), then attach more specific ntlk tags (ie. Person)
        for chunk in nltk.ne_chunk(nltk.pos_tag(re.findall(r"[\w]+", y))) :
            #filter out everything but the person labels
            if hasattr(chunk, 'label') and chunk.label() == "PERSON":
                #bring the results of the above into a list
                AggPeople.append(' '.join(c[0] for c in chunk.leaves()).capitalize())
                #filter out words you don't want
                AggPeople = [x for x in AggPeople if (x not in ['Schedules','Old'])]
    #get rid of duplicate words with 'set'
    return set(AggPeople)

text = "O:\Country\Province\District\city\Cricket, Jimmy (Y1617F)\Old Schedules\Cricket, Jimmy (78655) Golick doo wop 7 Sept 2016.xlsx"

print(extract_entities(text))

The problem is that the result is 'Jimmy y1617f' and I want it to say 'Jimmy'

I think the nltk.ne_chunk is grouping up words in a way that makes sense when dealing with text, but not with filepaths. To solve the problem, I tried to define my own equivalent of nltk.ne_chunk as follows:

import nltk
import re
from nltk import RegexpParser
def extract_entities(y):
    AggPeople = []
    patterns= r"<NP:{<NNP>+}"
    chunker = RegexpParser(patterns)
    print(chunker)
    for y in y.split("\\"):
        for chunk in chunker(nltk.pos_tag(re.findall(r"[\w]+", y))) :
            if hasattr(chunk, 'label') and chunk.label() == "PERSON":
                AggPeople.append(' '.join(c[0] for c in chunk.leaves()).capitalize())
                AggPeople = [x for x in AggPeople if (x not in ['Schedules','Old'])]
    return set(AggPeople)

Received an error code:

'RegexpParser' object is not callable

Full traceback:

chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<NNP>'>
Traceback (most recent call last):

  File "<ipython-input-282-cb323eff63b4>", line 1, in <module>
    runfile('C:/Users//.spyder-py3/ExtractingNames.py', wdir='C:/Users//.spyder-py3')

  File "C:\spydercustomize.py", line 827, in runfile
    execfile(filename, namespace)

  File "C:\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users//.spyder-py3/ExtractingNames.py", line 32, in <module>
    print(extract_entities(text))

  File "C:/Users//.spyder-py3/ExtractingNames.py", line 23, in extract_entities
    for chunk in chunker(nltk.pos_tag(re.findall(r"[\w]+", y))) :

TypeError: 'RegexpParser' object is not callable

Answer 1

#looks for two proper nouns side-by-side
patterns= r"P:{<NNP>{2}}"
chunker = nltk.RegexpParser(patterns)    

def extract_entities(y):
    AggPeople = []
    for y in y.split("\\"):
        #excludes words with digits and schedules
        for chunk in chunker.parse(nltk.pos_tag(re.findall(r"\b(?!Schedules|Old)[^\d\W]+\b", y))) :
            if hasattr(chunk, 'label') and chunk.label() == "P" :
                AggPeople.append(' '.join(c[0] for c in chunk.leaves()).capitalize())
    return set(AggPeople)

text = "O:\Country\Province\District\city\Cricket, Jimmy (Y1617F)\Old Schedules\Cricket, Jimmy (78655) Golick doo wop 7 Sept 2016.xlsx"

print(extract_entities(text))

You can make your code run faster if you place the chunker outside of the loop (otherwise, it gets regenerated with each iteration of the loop).
If you're looking for human names, and there is usually two of them (first and last), you can specify exactly two NNPs in the pattern with the {2} symbol.
You can exclude certain words in the regex with negative lookahead, and words with digits in it with ^\\d.

Extract names of persons from filepath

Question

1 answers

solution1
0 2021-06-25 18:39:38

Extract names of persons from filepath

Question

1 answers

solution1 0 2021-06-25 18:39:38

solution1
0 2021-06-25 18:39:38