简体   繁体   English

如何在NLTK中为复制器添加复合词?

[英]How to add compound words to the tagger in NLTK?

So, I was wondering if anyone had any idea how to combine multiple terms to create a single term in the taggers in NLTK. 所以,我想知道是否有人知道如何组合多个术语来在NLTK中的标记器中创建单个术语 .

For example, when I do: 例如,当我这样做时:

nltk.pos_tag(nltk.word_tokenize('Apple Incorporated is the largest company'))

It gives me: 它给了我:

[('Apple', 'NNP'), ('Incorporated', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('company', 'NN')]

How do I make it put 'Apple' and 'Incorporated' Together to be ('Apple Incorporated','NNP') 如何将'Apple'和'Incorporated'放在一起('Apple Incorporated','NNP')

You could try taking a look at nltk.RegexParser . 你可以试试看看nltk.RegexParser It allows you to chunk part of speech tagged content based on regular expressions. 它允许您根据正则表达式对部分语音标记内容进行分块。 In your example, you could do something like 在你的例子中,你可以做类似的事情

pattern = "NP:{<NN|NNP|NNS|NNPS>+}"
c = nltk.RegexpParser(p)
t = c.parse(nltk.pos_tag(nltk.word_tokenize("Apple Incorporated is the largest company")))
print t

This would give you: 这会给你:

Tree('S', [Tree('NP', [('Apple', 'NNP'), ('Incorporated', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), Tree('NP', [('company', 'NN')])])

The code is doing exactly what it is supposed to do. 代码正在完成它应该做的事情。 It is adding Part Of Speech tags to tokens. 它正在为令牌添加词性标签。 'Apple Incorporated' is not a single token. 'Apple Incorporated'不是一个单一的标记。 It is two separate tokens, and as such can't have a single POS tag applied to it. 它是两个单独的令牌,因此不能应用单个POS标签。 This is the correct behaviour. 这是正确的行为。

I wonder if you are trying to use the wrong tool for the job. 我想知道你是否正在尝试使用错误的工具来完成工作。 What are you trying to do / Why are you trying to do it? 你想做什么/你为什么要这样做? Perhaps you are interested in identifying collocations rather than POS tagging? 也许您有兴趣识别搭配而不是POS标记? You might have a look here: collocations module 你可以看看这里: 搭配模块

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM