I'm using the Stanford Named Entity Recognizer with Python to find the proper names in the novel "A Hundred years of solitud". There are many of them composed by first and last name eg "Aureliano Buendía" or "Santa Sofía de la Piedad". These Tokens are always separated eg "Aureliano" "Buendia", because of the tokenizer I am using. I would like to have them together as a token, so they can be tagged together as "PERSON" with Stanford NER.
The code I wrote:
import nltk
from nltk.tag import StanfordNERTagger
from nltk import word_tokenize
from nltk import FreqDist
sentence1 = open('book1.txt').read()
sentence = sentence1.split()
path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"
path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"
st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)
taggedSentence = st.tag(sentence)
def findtags (tagged_text,tag_prefix):
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in taggedSentence
if tag.endswith(tag_prefix))
return dict((tag, cfd[tag].most_common(1000)) for tag in cfd.conditions())
print (findtags('_','PERSON'))
The result looks like this:
{'PERSON': [('Aureliano', 397), ('José', 294), ('Arcadio', 286), ('Buendía', 251), ...
Does anybody have a solution? I would be more than grateful
import nltk
from nltk.tag import StanfordNERTagger
sentence1 = open('book1.txt').read()
sentence = sentence1.split()
path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"
path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"
st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)
taggedSentence = st.tag(sentence)
test = []
test_dict = {}
for element in range(len(taggedSentence)):
a = ''
if element < len(taggedSentence):
while taggedSentence[element][1] == 'PERSON':
a += taggedSentence[element][0] + ' '
taggedSentence.pop(element)
if len(a) > 1:
test.append(a.strip())
test_dict[data.split('.')[0]] = tuple(test)
print(test_dict)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.