简体   繁体   English

蟒蛇| 无法从文本中提取姓名列表

[英]Python| Unable to extract the list of names from the text

Executing the below to extract the list of names from the text1.执行以下命令从 text1 中提取名称列表。 The text1 variable is the merge of the pdf's. text1 变量是 pdf 的合并。 But executing the below code gives just one name out of complete input.但是执行下面的代码只给出完整输入中的一个名字。 Tried to change patterns but didn't work.试图改变模式但没有奏效。

Code:代码:

import spacy
from spacy.matcher import Matcher

# load pre-trained model
nlp = spacy.load('en_core_web_sm')

# initialize matcher with a vocab
matcher = Matcher(nlp.vocab)

def extract_name(resume_text):
    nlp_text = nlp(resume_text)
    #print(nlp_text)
    
    # First name and Last name are always Proper Nouns
    pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
    
    #matcher.add('NAME', None, [pattern])
    matcher.add('NAME', [pattern], on_match=None)
    
    matches = matcher(nlp_text)
    
    for match_id, start, end in matches:
        span = nlp_text[start:end]
        #print(span)
        return span.text

Execution: extract_name(text1) O/P: 'VIKRAM RATHOD'执行:extract_name(text1) O/P: 'VIKRAM RATHOD'

Expected O/P: List of all names in the text1预期 O/P:文本中所有名称的列表 1

For your questions:对于您的问题:

Adding the matcher declaration:添加匹配器声明:

self._nlp = spacy.load("en_core_web_lg") 
self._matcher = Matcher(self._nlp.vocab)  

As general best practice remove all punctuation:作为一般最佳实践,删除所有标点符号:

  table = str.maketrans(string.punctuation,' '*32)   ##Remove punctuation
    sentence = sentence .translate(table).strip() 

To catch middle name add:要捕获中间名,请添加:

pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN',"OP": "*"},{'POS': 'PROPN'}]

Now loop over all the matches and add them to a dict现在遍历所有匹配项并将它们添加到字典中

   New_list_of_matches={}
   for match_id, start, end in matches:
        string_id = ((self.NlpObj)._nlp.vocab).strings[match_id]  # Get string representation
        span=str((self.NlpObj)._doc[start:end]).split()           
        if string_id in New_list_of_matches:   
            if len(span)>New_list_of_matches[string_id]['lenofSpan']:
                New_list_of_matches[string_id]={'span':span,'lenofSpan':len(span)}
        else:
            New_list_of_matches[string_id]={'span':span,'lenofSpan':len(span)}

It is important to keep the length of the span that way you can differ between cases when you find names with 2 words with ones with 3 words(middle name)重要的是要保持跨度的长度,这样当你发现有 2 个单词的名字和有 3 个单词的名字(中间名)时,你可以在不同的情况下有所不同

Now:现在:

for keys,items in  New_list_of_matches.items():
   if keys=='NAME':
          if len(items['span'])==2:
                 Name=items['span'][items['lenofSpan']-2]+' '+items['span'][items['lenofSpan']-1]
          elif len(items['span'])==3:
                Name=items['span'][items['lenofSpan']-3]+items['span'][items['lenofSpan']-2]+' '+items['span'][items['lenofSpan']-1]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM