使用Stanford NER和Python将名字和姓氏标记为一个标记

Question

我正在使用带有Python的斯坦福命名实体识别器来找到小说“百年孤独”中的专有名称。 其中许多由名字和姓氏组成，例如“AurelianoBuendía”或“SantaSofíadela Piedad”。 由于我正在使用的标记器，这些标记总是分开，例如“Aureliano”“Buendia”。 我想将它们放在一起作为代币，因此它们可以与Stanford NER一起标记为“PERSON”。

我写的代码：

import nltk

from nltk.tag import StanfordNERTagger

from nltk import word_tokenize

from nltk import FreqDist

sentence1 = open('book1.txt').read()

sentence = sentence1.split()

path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"

path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"

st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)

taggedSentence = st.tag(sentence)

def findtags (tagged_text,tag_prefix):

    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in taggedSentence

                                   if tag.endswith(tag_prefix))

    return dict((tag, cfd[tag].most_common(1000)) for tag in cfd.conditions())


print (findtags('_','PERSON'))

结果如下：

{'PERSON'：[（'Aureliano'，397），（'José'，294），（'Arcadio'，286），（'Buendía'，251），......

有人有解决方案吗？ 我会非常感激

Answer 1

import nltk

from nltk.tag import StanfordNERTagger

sentence1 = open('book1.txt').read()

sentence = sentence1.split()

path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"

path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"

st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)

taggedSentence = st.tag(sentence)

test = [] 

test_dict = {}

for element in range(len(taggedSentence)):

    a = ''

    if element < len(taggedSentence):
       while taggedSentence[element][1] == 'PERSON':
          a += taggedSentence[element][0] + ' '
          taggedSentence.pop(element)
          if len(a) > 1:
             test.append(a.strip())

test_dict[data.split('.')[0]] = tuple(test)

print(test_dict)

使用Stanford NER和Python将名字和姓氏标记为一个标记

问题描述

1 个解决方案

解决方案1
0 2019-07-22 08:47:38

使用Stanford NER和Python将名字和姓氏标记为一个标记

问题描述

1 个解决方案

解决方案1 0 2019-07-22 08:47:38

解决方案1
0 2019-07-22 08:47:38