简体   繁体   English

标签为NLTK中的单个单词

[英]Tagger for single words in NLTK

Is there a tagger that would return a single tag for a word in whatever context it maybe? 是否有一个标记器可以在任何上下文中为单词返回单个标记?

My requirement is that I need to extract words from unstructured text where the sentences would not have a structured grammar. 我的要求是我需要从非结构化文本中提取单词,其中句子没有结构化语法。

POS taggers are meant to work with sentences and would return a tag for a word depending on the context of the word in that sentence. POS标记符用于处理句子,并根据该句子中单词的上下文返回单词的标记。 So, I would either have to use another tagger that would give me the same tag for a particular word each time or use all the possible tags for a word while chunking. 所以,我要么必须使用另一个标记器,每次给我一个特定单词的相同标签,或者在分块时使用所有可能的标签。

Any other solutions would be greatly appreciated. 任何其他解决方案将不胜感激。 Also, how can you view all the tags that can be assigned for a particular word? 另外,如何查看可以为特定单词指定的所有标签?

See: http://www.nltk.org/_modules/nltk/tag.html 见: http//www.nltk.org/_modules/nltk/tag.html

In particular: 尤其是:

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

The idea of the UnigramTagger is that it always assigns the tag that was most prominent for that particular word in the training corpus. UnigramTagger的想法是它总是为训练语料库中的特定单词指定最突出的标签。 Or (just above the piece of code in the docs: 或者(在文档中的代码段上方:

This package defines several taggers, which take a token list (typically a sentence), assign a tag to each token, and return the resulting list of tagged tokens. 这个包定义了几个标记器,它们采用一个标记列表(通常是一个句子),为每个标记分配一个标记,并返回标记标记的结果列表。 Most of the taggers are built automatically based on a training corpus. 大多数标记符都是基于训练语料库自动构建的。 For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus: 例如,单字组恶搞标签每个词w通过检查有什么用W最频繁的标签在训练语料库:

Not sure if there is a built-in way to view all tags that can be assigned to a particular word. 不确定是否有内置方法可以查看可以分配给特定单词的所有标记。 Moreover; 此外; this may theoretically be as long as the total number of tags identified, as it depends on context. 理论上,这可以与识别的标签总数一样长,因为它取决于上下文。 If you want to get an idea; 如果你想得到一个想法; what I would do is just tag your whole vocabulary and print out your vocabulary with all different tags assigned in that particular corpus. 我要做的就是标记你的整个词汇,并打印出你在该特定语料库中分配的所有不同标签的词汇。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM