简体   繁体   English

如何计算WordNet中没有出现的英文单词的相似度?

[英]How to calculate the similarity of English words that do not appear in WordNet?

A particular natural language practice is to calculate the similarity between two words using WordNet. 一种特殊的自然语言实践是使用WordNet计算两个单词之间的相似性。 I start my question with the following python code: 我用以下python代码开始我的问题:

from nltk.corpus import wordnet
sport = wordnet.synsets("sport")[0]
badminton = wordnet.synsets("badminton")[0]
print(sport.wup_similarity(badminton))

We will get 0.8421 我们将得到0.8421

Now what if I look for "haha" and "lol" as following: 现在如果我寻找“haha”和“lol”如下:

haha = wordnet.synsets("haha")
lol = wordnet.synsets("lol")
print(haha)
print(lol)

We will get 我们将得到

[]
[]

Then we cannot consider the similarity between them. 然后我们不能考虑它们之间的相似性。 What can we do in this case? 在这种情况下我们能做些什么?

You can create a semantic space from cooccurrence matrices using a tool like Dissect (DIStributional SEmantics Composition Toolkit) and then you are set to measure semantic similarity between words or phrases (if you compose words). 您可以使用像Dissect (Distributional SEmantics Composition Toolkit)这样的工具从共生矩阵创建语义空间,然后设置为测量单词或短语之间的语义相似性(如果您组成单词)。

In your case for ha and lol you'll need to collect those cooccurrences. 在你的halol的情况下,你需要收集这些同谋。

Another thing to try is word2vec. 另一件要尝试的是word2vec。

There are two possible other ways: 有两种可能的其他方式:

CBOW: continuous bag of word CBOW:连续的一句话

skip gram model: This model is vice versa of CBOW model 跳过克模型:这个模型与CBOW模型相反

look at this: https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures-in-laymans-terms 看看这个: https//www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures-in-laymans-terms

These model are well represted here: https://www.tensorflow.org/tutorials/word2vec , also GENSIM is a good python library for doing such these things 这些模型在这里很受欢迎: https//www.tensorflow.org/tutorials/word2vec,GENSIM也是一个很好的python库来做这些事情


Try to look for Tensorflow Solutions, For example this: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py 尝试寻找Tensorflow解决方案,例如: https//github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py

Or try to look for word2vec: https://en.wikipedia.org/wiki/Word2vec 或者尝试寻找word2vec: https ://en.wikipedia.org/wiki/Word2vec

There are different models for measuring similarity, such as word2vec or glove, but you seem to be looking more for a corpus which includes social, informal phrases like 'lol'. 有不同的测量相似度的模型,例如word2vec或手套,但你似乎更多地寻找包含社交,非正式短语如“lol”的语料库。

However, I'm going to bring up word2vec because it leads to what I think is an answer to your question. 但是,我要提出word2vec,因为它会导致我认为是你问题的答案。

The foundational concept of word2vec (and other word embedding models like glove) is the representation of words in a vector space which incorporates relationships between words. word2vec(以及其他单词嵌入模型,如手套)的基本概念是向量空间中的单词表示,其中包含单词之间的关系。 This lends itself very well to measuring similarity, since vectors have lots of established math to draw from. 这非常适合测量相似性,因为矢量有很多已建立的数学可供绘制。 You can read more about the technical details of word2vec in the original paper, but I quite like this blog post because it is well-written and concise. 您可以在原始论文中阅读有关word2vec技术细节的更多信息但我非常喜欢这篇博 文,因为它写得很好而且简洁。

Again, since word2vec is just a model, you need to pair it with the right training set to get the kind of scope you seek. 同样,由于word2vec只是一个模型,你需要将它与正确的训练集配对,以获得你所寻求的那种范围。 There are some pre-trained models floating around on the web, such as this bunch. 网络上有一些预先训练好的模型,比如这一组。 The training set is really what allows you to query a larger variety of terms, rather than the model. 训练集实际上允许您查询更多种类的术语,而不是模型。

You can certainly use those pre-trained models if they have social phrases like the ones you're seeking. 如果他们有像您正在寻找的社交短语,您当然可以使用这些预先训练的模型。 However, if you don't see a model that has been trained on a suitable corpus, you could easily train a model yourself. 但是,如果您没有看到在合适的语料库中训练过的模型,您可以自己轻松地训练模型。 I suggest Twitter or Wikipedia for corpora (training sets), and the implementation of word2vec in gensim as a word embedding model. 我建议使用Twitter或Wikipedia作为语料库(训练集),并将gensim中word2vec的实现作为单词嵌入模型。

You can use other frameworks. 您可以使用其他框架。 I was trying also NLTK but finally landed on spacy (spacy.io) very fast and functional framework. 我也尝试了NLTK,但终于登陆了spacy(spacy.io)非常快速和功能的框架。 There is a method for words called 'similarity' which compers to other words(but it works also for sentences, docs etc). 有一种称为“相似性”的单词的方法,它可以用于其他单词(但它也适用于句子,文档等)。 It is implemented using word2vec. 它是使用word2vec实现的。 Actually I don't know how big is their vocabulary and how it struggle in case the word is unknown but it might be worth to try. 实际上我不知道他们的词汇量有多大,以及如果这个词不为人知,它会如何挣扎但是值得尝试。

I was also playing a little bit with this one: https://radimrehurek.com/gensim/models/word2vec.html Where in 2 lines you can load google's big word2vec model(this project ports google word2vec c++ library into python) accessible here: https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit 我也玩了一下这个: https//radimrehurek.com/gensim/models/word2vec.html在2行中你可以加载google的大word2vec模型(这个项目将google word2vec c ++库移植到python中) : https//docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM