简体   繁体   English

NLTK 每个词最常见的同义词 (Wordnet)

[英]NLTK Most common synonym (Wordnet) for each word

Is there a way to find the most common synonym of a word with NLTK?有没有办法用 NLTK 找到一个单词最常见的同义词? I would like to simplify a sentence using the most common synonyms of each word on it.我想使用每个单词最常见的同义词来简化句子。

If a word used in the sentence is already the most common word from its group of synonyms, it shouldn't be changed.如果句子中使用的单词已经是同义词组中最常用的单词,则不应更改它。

Let's say "Hi" is more common than "Hello";假设“Hi”比“Hello”更常见; "Dear" is more common than "Valued"; “亲爱的”比“有价值的”更常见; and "Friend" is already the most common word of its group os synonyms.而“朋友”已经是其同义词组中最常见的词。

Input: "Hello my valued friend"
Return: "Hi my dear friend"

Synonyms are tricky, but if you are starting out with a synset from Wordnet and you simply want to choose the most common member in the set, it's pretty straightforward: Just build your own frequency list from a corpus, and look up each member of the synset to pick the maximum.同义词很棘手,但如果您从 Wordnet 中的同义词集开始,并且只想选择该集合中最常见的成员,这非常简单:只需从语料库构建您自己的频率列表,然后查找该集合中的每个成员选择最大值的同义词集。

The nltk will let you build a frequency table in just a few lines of code. nltk 将让您只需几行代码即可构建频率表。 Here's one based on the Brown corpus:这是基于布朗语料库的一个:

from nltk.corpus import brown
freqs = nltk.FreqDist(w.lower() for w in brown.words())

You can then look up the frequency of a word like this:然后,您可以像这样查找单词的频率:

>>> print(freqs["valued"]) 
14

Of course you'll need to do a little more work: I would count words separately for each of the major parts of speech (wordnet provides n , v , a , and r , resp. noun , verb , adjective and adverb ), and use these POS-specific frequencies (after adjusting for the different tagset notations) to choose the right substitute.当然,你需要做更多的工作:我会为每个主要词类分别计算单词(wordnet 提供nvar ,分别是nounverbadjectiveadverb ),以及使用这些特定于 POS 的频率(在针对不同的标记集符号进行调整后)来选择正确的替代品。

>>> freq2 = nltk.ConditionalFreqDist((tag, wrd.lower()) for wrd, tag in 
        brown.tagged_words(tagset="universal"))

>>> print(freq2["ADJ"]["valued"])
0
>>> print(freq2["ADJ"]["dear"])
45

Synonyms are a huge and open area of work in natural language processing.同义词是自然语言处理中一个巨大而开放的工作领域。

In your example, how is the program supposed to know what the allowed synonyms are?在您的示例中,程序应该如何知道允许的同义词是什么? One method might be to keep a dictionary of sets of synonyms for each word.一种方法可能是为每个单词保留一组同义词词典。 However, this can run into problems due to overlaps in parts of speech : "dear" is an adjective, but "valued" can be an adjective or a past-tense verb.但是,由于词性重叠,这可能会遇到问题:“亲爱的”是形容词,但“有价值”可以是形容词或过去时动词。

Context is also important: the bigram "dear friend" might be more common than "valued friend", but "valued customer" would be more common than "dear customer".上下文也很重要:bigram“亲爱的朋友”可能比“尊贵的朋友”更常见,但“尊贵的客户”比“亲爱的客户”更常见。 So, the sense of a given word needs to be accounted for too.因此,也需要考虑给定单词的含义

Another method might be to just look at everything and see what words appear in similar contexts.另一种方法可能是查看所有内容并查看在相似上下文中出现的单词。 You need a huge corpus for this to be effective though, and you have to decide how large a window of n-grams you want to use (a bigram context? A 20-gram context?).但是,您需要一个庞大的语料库才能使其有效,并且您必须决定要使用多大的n-gram窗口(bigram 上下文?20-gram 上下文?)。

I recommend you take a look at applications of WordNet ( https://wordnet.princeton.edu/ ), which was designed to help figure some of these things out.我建议您查看WordNet ( https://wordnet.princeton.edu/ ) 的应用程序,它旨在帮助解决其中的一些问题。 Unfortunately, I'm not sure you'll find a way to "solve" synonyms on your own, but keep looking and asking questions!不幸的是,我不确定您是否会找到自己“解决”同义词的方法,但请继续寻找并提出问题!

Edit : I should have included this link to an older question as well:编辑:我也应该将此链接包含在一个较旧的问题中:

How to get synonyms from nltk WordNet Python 如何从 nltk WordNet Python 中获取同义词

And the NLTK documentation on its interface with WordNet:以及有关其与 WordNet 接口的 NLTK 文档:

http://www.nltk.org/howto/wordnet.html http://www.nltk.org/howto/wordnet.html

I don't think these address your question, however, since WordNet doesn't have usage statistics (which are dependent on the corpus you use).但是,我认为这些并不能解决您的问题,因为 WordNet 没有使用统计数据(这取决于您使用的语料库)。 You should be able to apply its synsets in a method like above, though.不过,您应该能够在上述方法中应用其同义词集。

The other answer shows you how to use synonyms:另一个答案向您展示了如何使用同义词:

 wn.synsets('small') [Synset('small.n.01'), Synset('small.n.02'), Synset('small.a.01'), Synset('minor.s.10'), Synset('little.s.03'), Synset('small.s.04'), Synset('humble.s.01'), Synset('little.s.07'), Synset('little.s.05'), Synset('small.s.08'), Synset('modest.s.02'), Synset('belittled.s.01'), Synset('small.r.01')]

You now know how to get all the synonyms for a word.您现在知道如何获取一个单词的所有同义词。 That's not the hard part.这不是困难的部分。 The hard part is determining what's the most common synonym.困难的部分是确定什么是最常见的同义词。 This question is highly domain dependent.这个问题高度依赖领域。 Most common synonym where?最常见的同义词在哪里? In literature?在文学? In common vernacular?普通话? In technical speak?用技术说话?

Like you, I wanted to get an idea of how the English language was used.和你一样,我想知道如何使用英语。 I downloaded 15,000 entire books from ( Project Gutenberg ) and processed the word and letter pair frequencies on all of them.我从(古腾堡计划)下载了 15,000 整本书,并处理了所有书籍的单词和字母对频率。 After ingesting such a large corpus, I could see which words were used most commonly.摄入这么大的语料后,我可以看到哪些词最常用。 Like I said above, though, it will depend on what you're trying to process.不过,就像我上面所说的,这将取决于您要处理的内容。 If it's something like Twitter posts, try ingesting a ton of tweets.如果它类似于 Twitter 帖子,请尝试接收大量推文。 Learn from what you have to eventually process.从您最终必须处理的内容中学习。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM