简体   繁体   English

从NLTK for Python中的同义词列表中提取单词

[英]Extract word from a list of synsets in NLTK for Python

Using this [x for x in wn.all_synsets('n')] I am able to get a list allnouns with all nouns from Wordnet with help from NLTK. 使用这种[x for x in wn.all_synsets('n')]我能得到一个列表allnouns与共发现所有名词与NLTK帮助。

The list allnouns looks like this Synset('pile.n.01'), Synset('compost_heap.n.01'), Synset('mass.n.03') and so on. 列表allnouns看起来像这个Synset('pile.n.01'), Synset('compost_heap.n.01'), Synset('mass.n.03')等等。 Now I am able to get any element by using allnouns[2] and this should be Synset('mass.n.03') . 现在我可以使用allnouns[2]获得任何元素,这应该是Synset('mass.n.03')

I would like to extract only the word mass but for some reason I cannot treat it like a string and everything I try shows a AttributeError: 'Synset' object has no attribute or TypeError: 'Synset' object is not subscriptable or <bound method Synset.name of Synset('mass.n.03')> if I try to use .name or .pos 我想只提取单词mass但由于某种原因我不能把它当成一个字符串而我尝试的所有东西都显示出一个AttributeError: 'Synset' object has no attributeTypeError: 'Synset' object is not subscriptable<bound method Synset.name of Synset('mass.n.03')>如果我尝试使用.name或.pos

How about trying this solution: 尝试这个解决方案怎么样:

>>>> from nltk.corpus import wordnet as wn
>>>> wn.synset('mass.n.03').name().split(".")[0]
'mass'

For your case: 对于你的情况:

>>>> allnouns = [x for x in wn.all_synsets('n')]  

The item at 23rd index is "Synset('substance.n.07')". 第23个索引的项目是“Synset('substance.n.07')”。 Now, you can extract its name field like 现在,您可以提取其名称字段

>>>> allnouns[23].name().split(".")[0]
'substance'   #output

If you want only the 'name' fields of the synsets of 'noun' category in the list, then use: 如果您只想要列表中“名词”类别的同义词集的“名称”字段,请使用:

>>>> [x.name().split(".")[0] for x in wn.all_synsets('n')]

should exactly give the result you need. 应该准确地给出你需要的结果。

Note: In wordnet, name is not an attribute rather it is a function! 注意:在wordnet中, name不是属性,而是一个函数!

Using Synset.names() to get the canonical lemma name of the synset: 使用Synset.names()获取synset的规范引理名称:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('mass', 'n')
[Synset('mass.n.01'), Synset('batch.n.02'), Synset('mass.n.03'), Synset('mass.n.04'), Synset('mass.n.05'), Synset('multitude.n.03'), Synset('bulk.n.02'), Synset('mass.n.08'), Synset('mass.n.09')]
>>> wn.synsets('mass', 'n')[0]
Synset('mass.n.01')
>>> wn.synsets('mass', 'n')[0].name()
u'mass.n.01'
>>> wn.synsets('mass', 'n')[0].name().split('.')[0]
u'mass'

But do note that sometimes a synset is made up of several lemmas, so you should use Synset.lemma_names() to access all lemmas if you're using the surface word form of a synset: 但请注意,有时一个synset由几个Synset.lemma_names() ,所以如果你使用synset的表面单词形式,你应该使用Synset.lemma_names()来访问所有的Synset.lemma_names()

>>> wn.synsets('mass', 'n')[0].lemmas()
[Lemma('mass.n.01.mass')]
>>> wn.synsets('mass', 'n')[0].lemma_names()
[u'mass']
>>> wn.synsets('mass', 'n')[0].definition()
u'the property of a body that causes it to have weight in a gravitational field'

In the wn.synsets('mass', 'n')[0] case there's only 1 lemma attached to the synset. wn.synsets('mass', 'n')[0]情况下,只有1个引理附加到synset。 But sometimes there's more than one, eg 但有时不止一个,例如

>>> wn.synsets('mass', 'n')[1].lemma_names()
[u'batch', u'deal', u'flock', u'good_deal', u'great_deal', u'hatful', u'heap', u'lot', u'mass', u'mess', u'mickle', u'mint', u'mountain', u'muckle', u'passel', u'peck', u'pile', u'plenty', u'pot', u'quite_a_little', u'raft', u'sight', u'slew', u'spate', u'stack', u'tidy_sum', u'wad']
>>> wn.synsets('mass', 'n')[1].definition()
u"(often followed by `of') a large number or amount or extent"

And to exact all list of words in wordnet, you can try: 要确定wordnet中的所有单词列表,您可以尝试:

>>> from itertools import chain
>>> set(chain(*[i.lemma_names() for i in wn.all_synsets('n')]))
>>> len(set(chain(*[i.lemma_names() for i in wn.all_synsets('n')])))
119034

See Making a flat list out of list of lists in Python 请参阅在Python中列出列表列表中的平面列表

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM