简体   繁体   中英

extract the name of candidate from text file using python and nltk

import re
import spacy
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.corpus import wordnet

inputfile = open('inputfile.txt', 'r')
String= inputfile.read()
nlp = spacy.load('en_core_web_sm')

def candidate_name_extractor(input_string, nlp):
    input_string = str(input_string)

    doc = nlp(input_string)

    # Extract entities
    doc_entities = doc.ents

    # Subset to person type entities
    doc_persons = filter(lambda x: x.label_ == 'PERSON', doc_entities)
    doc_persons = filter(lambda x: len(x.text.strip().split()) >= 2, doc_persons)
    doc_persons = list(map(lambda x: x.text.strip(), doc_persons))
    print(doc_persons)
    # Assuming that the first Person entity with more than two tokens is the candidate's name
    candidate_name = doc_persons[0]
    return candidate_name

if __name__ == '__main__':
    names = candidate_name_extractor(String, nlp)

print(names)

I want to extract the name of candidate from text file, but it returns the wrong value. when i remove list with map then map is also not working and gives the error

import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.corpus import wordnet

String = 'Ravana was killed in a war'

Sentences = nltk.sent_tokenize(String)
Tokens = []
for Sent in Sentences:
    Tokens.append(nltk.word_tokenize(Sent)) 
Words_List = [nltk.pos_tag(Token) for Token in Tokens]

Nouns_List = []

for List in Words_List:
    for Word in List:
        if re.match('[NN.*]', Word[1]):
             Nouns_List.append(Word[0])

Names = []
for Nouns in Nouns_List:
    if not wordnet.synsets(Nouns):
        Names.append(Nouns)

print (Names)

Check this code. I am getting Ravana as output.

EDIT:

I used a few sentences from my resume to create a text file, and gave it as input to my program. Only the changed portion of the code is shown below:

import io

File = io.open("Documents\\Temp.txt", 'r', encoding = 'utf-8')
String = File.read()
String = re.sub('[/|.|@|%|\d+]', '', String)

And it is returning all the names that are not in the wordnet corpus, like my name, my house name, place, college name and place.

From the word list obtained after parts-of-speech tagging, extract all the words having noun tag using regular expression:

Nouns_List = []

for Word in nltk.pos_tag(Words_List):
    if re.match('[NN.*]', Word[1]):
         Nouns_List.append(Word[0])

For each word in the Nouns_List , check whether it is an English word. This can be done by checking whether synsets are available for that word in wordnet :

from nltk.corpus import wordnet

Names = []
for Nouns in Nouns_List:
    if not wordnet.synsets(Nouns):
        #Not an English word
        Names.append(Nouns)

Since Indian names cannot be entries in English dictionary, this can be a possible method to extract them from a text.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM