简体   繁体   English

Name Entity Recognition NLTK 回顾

[英]Review of Name Entity Recognition NLTK

I am trying to create one entity recognition(NER) application, where I am trying to take Parts of Speech Tagging(PoS) approach.我正在尝试创建一个实体识别(NER)应用程序,我正在尝试采用部分语音标记(PoS)方法。 I am trying to exploit Python's NLTK library, and using it as hmm_tagger=nltk.HiddenMarkovModelTagger.train(train_set) .我正在尝试利用 Python 的 NLTK 库,并将其用作hmm_tagger=nltk.HiddenMarkovModelTagger.train(train_set) In train set, I am trying to give data in the format of Brown corpus's tagged_sents().在训练集中,我试图以 Brown 语料库的 tagged_sents() 格式提供数据。 It is as below for PoS Tagging PoS标签如下

brown_a = nltk.corpus.brown.tagged_sents()[:2]
>>> brown_a
[[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')], [(u'The', u'AT'), (u'jury', u'NN'), (u'further', u'RBR'), (u'said', u'VBD'), (u'in', u'IN'), (u'term-end', u'NN'), (u'presentments', u'NNS'), (u'that', u'CS'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'Executive', u'JJ-TL'), (u'Committee', u'NN-TL'), (u',', u','), (u'which', u'WDT'), (u'had', u'HVD'), (u'over-all', u'JJ'), (u'charge', u'NN'), (u'of', u'IN'), (u'the', u'AT'), (u'election', u'NN'), (u',', u','), (u'``', u'``'), (u'deserves', u'VBZ'), (u'the', u'AT'), (u'praise', u'NN'), (u'and', u'CC'), (u'thanks', u'NNS'), (u'of', u'IN'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'of', u'IN-TL'), (u'Atlanta', u'NP-TL'), (u"''", u"''"), (u'for', u'IN'), (u'the', u'AT'), (u'manner', u'NN'), (u'in', u'IN'), (u'which', u'WDT'), (u'the', u'AT'), (u'election', u'NN'), (u'was', u'BEDZ'), (u'conducted', u'VBN'), (u'.', u'.')]]

{Here size of brown_a we may increase. {这里的 brown_a 的大小我们可能会增加。 It is given only as an example.}仅作为示例给出。}

I am now trying to build an NER, where, I am changing above data as,我现在正在尝试构建一个 NER,我将上面的数据更改为,

[[(u'The', u'NameP'), (u'Fulton', u'Name'), (u'County', u'NameC'), (u'Grand', u'NameCC'), (u'Jury', u'NameCCC'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NA'), (u'of', u'NA'), (u"Atlanta's", u'Name'), (u'recent', u'NA'), (u'primary', u'NA'), (u'election', u'NA'), (u'produced', u'NA'), (u'``', u'NA'), (u'no', u'NA'), (u'evidence', u'NA'), (u"''", u"NA"), (u'that', u'NA'), ...]

Here, I am keeping data format but changing the tagset to my definition as, NA for Not Available(anything which is not NE), NameP for Previous to Name, Name for Name,..etc.在这里,我保留数据格式,但将标记集更改为我的定义,NA 表示不可用(任何不是 NE 的内容),NameP 表示 Name 之前的名称,Name 表示名称等。

I am now making this new data as training set and training.我现在将这些新数据作为训练集和训练。

Is my approach fine or do I need to change anything major?我的方法好还是我需要改变什么大的?

Please suggest.请建议。

Why not to use a ready NER system, such as CRF-NER or Mallet ?为什么不使用现成的 NER 系统,例如CRF-NERMallet Are you doing this for academic purposes or you have a business problem, which needs to be solved?您这样做是出于学术目的还是有业务问题需要解决? In case of the latter, try working with something already built to get the initial results and if they don't meet your expectation, only then consider your implementation.在后者的情况下,尝试使用已经构建的东西来获得初始结果,如果它们不符合您的期望,那么才考虑您的实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM