简体   繁体   English

NLTK Wordnet Synset用于单词短语

[英]NLTK Wordnet Synset for word phrase

I'm working with the Python NLTK Wordnet API. 我正在使用Python NLTK Wordnet API。 I'm trying to find the best synset that represents a group of words. 我正在尝试找到代表一组单词的最佳synset。

If I need to find the best synset for something like "school & office supplies", I'm not sure how to go about this. 如果我需要为“学校和办公用品”找到最好的同义词,我不知道如何解决这个问题。 So far I've tried finding the synsets for the individual words and then computing the best lowest common hypernym like this: 到目前为止,我已经尝试找到单个单词的同义词,然后计算最好的最低常见上限,如下所示:

def find_best_synset(category_name):
    text = word_tokenize(category_name)
    tags = pos_tag(text)

    node_synsets = []
    for word, tag in tags:
        pos = get_wordnet_pos(tag)
        if not pos:
            continue
        node_synsets.append(wordnet.synsets(word, pos=pos))

    max_score = 0
    max_synset = None
    max_combination = None
    for combination in itertools.product(*node_synsets):
        for test in itertools.combinations(combination, 2):
            score = wordnet.path_similarity(test[0], test[1])
            if score > max_score:
                max_score = score
                max_combination = test
                max_synset = test[0].lowest_common_hypernyms(test[1])
    return max_synset

However this doesn't work very well plus it is very costly. 然而,这不是很好,而且成本很高。 Are there any ways to figure out which synset best represents multiple words together? 有没有办法找出哪个synset最能代表多个单词?

Thanks for your help! 谢谢你的帮助!

Apart from what I said in the comments already, I think the way you select the best hyperonym might be flawed. 除了我在评论中已经说过的内容,我认为你选择最好的超级用户的方式可能有缺陷。 The synset you end up with is not the lowest common hyperonym of all words, but only that of two of them. 您最终得到的同义词不是所有单词的最低常见超字,而只是其中两个单词的最低位。

Let's stick with your example of "school & office supplies". 让我们坚持你的“学校和办公用品”的例子。 For each word in the expression you get a number of synsets. 对于表达式中的每个单词,您将获得许多同义词。 So the variable node_synsets will look something like the following: 所以变量node_synsets看起来如下所示:

[[school_1, school_2], [office_1, office_2, office_3], [supply_1]]

In this example, there are 6 ways to combine each synset with any of the others: 在此示例中,有6种方法可以将每个synset与任何其他synset组合在一起:

[(school_1, office_1, supply_1),
 (school_1, office_2, supply_1),
 (school_1, office_3, supply_1),
 (school_2, office_1, supply_1),
 (school_2, office_2, supply_1),
 (school_2, office_3, supply_1)]

These triples are what you iterate over in the outer for loop (with itertools.product ). 这些三元组是你在外部for循环中迭代的东西(使用itertools.product )。 If the expression has 4 words, you would iterate over quadruples, with 5 it's quintuples, etc. 如果表达式有4个单词,你将迭代四倍,其中5个是五元组等。

Now, with the inner for loop, you pair off each triple. 现在,使用内部for循环,您可以配对每个三元组。 The first one is: 第一个是:

[(school_1, office_1),
 (school_1, supply_1),
 (office_1, supply_1)]

... and you determine the lowest hyperonym among each pair. ...并确定每对中最低的超名称。 So in the end you get the lowest hyperonym of, say, school_2 and office_1 , which might be some kind of institution. 所以最后你会得到最低的superonym,比如, school_2office_1 ,这可能是某种机构。 This is probably not very meaningful, as it doesn't consider any synset of the last word. 这可能不是很有意义,因为它不考虑最后一个单词的任何synset。

Maybe you should try to find the lowest common hyperonym of all three words, in each combination of their synsets, and take the one scoring best among them. 也许你应该尝试在他们的同义词的每个组合中找到所有三个单词的最低常见超量名,并在其中获得最佳得分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM