如何在NLTK中为复制器添加复合词？

Question

So, I was wondering if anyone had any idea how to combine multiple terms to create a single term in the taggers in NLTK. 所以，我想知道是否有人知道如何组合多个术语来在NLTK中的标记器中创建单个术语。 . 。

For example, when I do: 例如，当我这样做时：

nltk.pos_tag(nltk.word_tokenize('Apple Incorporated is the largest company'))

It gives me: 它给了我：

[('Apple', 'NNP'), ('Incorporated', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('company', 'NN')]

How do I make it put 'Apple' and 'Incorporated' Together to be ('Apple Incorporated','NNP') 如何将'Apple'和'Incorporated'放在一起('Apple Incorporated','NNP')

Answer 1

You could try taking a look at nltk.RegexParser . 你可以试试看看nltk.RegexParser 。 It allows you to chunk part of speech tagged content based on regular expressions. 它允许您根据正则表达式对部分语音标记内容进行分块。 In your example, you could do something like 在你的例子中，你可以做类似的事情

pattern = "NP:{<NN|NNP|NNS|NNPS>+}"
c = nltk.RegexpParser(p)
t = c.parse(nltk.pos_tag(nltk.word_tokenize("Apple Incorporated is the largest company")))
print t

This would give you: 这会给你：

Tree('S', [Tree('NP', [('Apple', 'NNP'), ('Incorporated', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), Tree('NP', [('company', 'NN')])])

Answer 2

The code is doing exactly what it is supposed to do. 代码正在完成它应该做的事情。 It is adding Part Of Speech tags to tokens. 它正在为令牌添加词性标签。 'Apple Incorporated' is not a single token. 'Apple Incorporated'不是一个单一的标记。 It is two separate tokens, and as such can't have a single POS tag applied to it. 它是两个单独的令牌，因此不能应用单个POS标签。 This is the correct behaviour. 这是正确的行为。

I wonder if you are trying to use the wrong tool for the job. 我想知道你是否正在尝试使用错误的工具来完成工作。 What are you trying to do / Why are you trying to do it? 你想做什么/你为什么要这样做？ Perhaps you are interested in identifying collocations rather than POS tagging? 也许您有兴趣识别搭配而不是POS标记？ You might have a look here: collocations module 你可以看看这里：搭配模块

如何在NLTK中为复制器添加复合词？

问题描述

2 个解决方案

解决方案1
1 2013-06-10 12:25:55

解决方案2
0 2013-06-11 14:33:22

如何在NLTK中为复制器添加复合词？

问题描述

2 个解决方案

解决方案1 1 2013-06-10 12:25:55

解决方案2 0 2013-06-11 14:33:22

解决方案1
1 2013-06-10 12:25:55

解决方案2
0 2013-06-11 14:33:22