简体   繁体   中英

First and Last name tagged as one token by using Stanford NER with Python

I'm using the Stanford Named Entity Recognizer with Python to find the proper names in the novel "A Hundred years of solitud". There are many of them composed by first and last name eg "Aureliano Buendía" or "Santa Sofía de la Piedad". These Tokens are always separated eg "Aureliano" "Buendia", because of the tokenizer I am using. I would like to have them together as a token, so they can be tagged together as "PERSON" with Stanford NER.

The code I wrote:

import nltk

from nltk.tag import StanfordNERTagger

from nltk import word_tokenize

from nltk import FreqDist

sentence1 = open('book1.txt').read()

sentence = sentence1.split()

path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"

path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"

st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)

taggedSentence = st.tag(sentence)

def findtags (tagged_text,tag_prefix):

    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in taggedSentence

                                   if tag.endswith(tag_prefix))

    return dict((tag, cfd[tag].most_common(1000)) for tag in cfd.conditions())


print (findtags('_','PERSON'))

The result looks like this:

{'PERSON': [('Aureliano', 397), ('José', 294), ('Arcadio', 286), ('Buendía', 251), ...

Does anybody have a solution? I would be more than grateful

import nltk

from nltk.tag import StanfordNERTagger

sentence1 = open('book1.txt').read()

sentence = sentence1.split()

path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"

path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"

st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)

taggedSentence = st.tag(sentence)

test = [] 

test_dict = {}

for element in range(len(taggedSentence)):

    a = ''

    if element < len(taggedSentence):
       while taggedSentence[element][1] == 'PERSON':
          a += taggedSentence[element][0] + ' '
          taggedSentence.pop(element)
          if len(a) > 1:
             test.append(a.strip())

test_dict[data.split('.')[0]] = tuple(test)

print(test_dict)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM