NLTK在荷兰命名实体识别

Question

I am trying to extract named entities from dutch text. 我试图从荷兰文本中提取命名实体。 I used nltk-trainer to train a tagger and a chunker on the conll2002 dutch corpus. 我使用nltk-trainer在conll2002荷兰语语料库上训练一个tagger和一个chunker。 However, the parse method from the chunker is not detecting any named entities. 但是，来自chunker的解析方法未检测到任何命名实体。 Here is my code: 这是我的代码：

str = 'Christiane heeft een lam.'

tagger = nltk.data.load('taggers/dutch.pickle')
chunker = nltk.data.load('chunkers/dutch.pickle')

str_tags = tagger.tag(nltk.word_tokenize(str))
print str_tags

str_chunks = chunker.parse(str_tags)
print str_chunks

And the output of this program: 而这个程序的输出：

[('Christiane', u'N'), ('heeft', u'V'), ('een', u'Art'), ('lam', u'Adj'), ('.', u'Punc')]
(S Christiane/N heeft/V een/Art lam/Adj ./Punc)

I was expecting Christiane to be detected as a named entity. 我期待Christiane被检测为命名实体。 Any help? 有帮助吗？

Answer 1

The conll2002 corpus has both spanish and dutch text, so you should make sure to use the fileids parameter, as in python train_chunker.py conll2002 --fileids ned.train . conll2002语料库有西班牙语和荷兰语文本，所以你应该确保使用fileids参数，如python train_chunker.py conll2002 --fileids ned.train 。 Training on both spanish and dutch will have poor results. 西班牙语和荷兰语的训练效果不佳。

The default algorithm is a Tagger based Chunker, which does not work well on conll2002. 默认算法是基于Tagger的Chunker，它在conll2002上不能很好地工作。 Instead, use a classifier based chunker like NaiveBayes, so the full command might look like this (and I've confirmed that the resulting chunker does recognize "Christiane" as a "PER"): 相反，使用像NaiveBayes这样的基于分类器的分块，因此完整命令可能看起来像这样（并且我已经确认结果chunker确实将“Christiane”识别为“PER”）：

python train_chunker.py conll2002 --fileids ned.train --classifier NaiveBayes --filename ~/nltk_data/chunkers/conll2002_ned_NaiveBayes.pickle

NLTK在荷兰命名实体识别

问题描述

1 个解决方案

解决方案1
7 已采纳 2012-07-06 01:43:18

NLTK在荷兰命名实体识别

问题描述

1 个解决方案

解决方案1 7 已采纳 2012-07-06 01:43:18

解决方案1
7 已采纳 2012-07-06 01:43:18