简体   繁体   English

使用Word2Vec时需要上下文

[英]Need of context while using Word2Vec

I have a large number of strings in a list: A small example of the list contents is : 列表中有很多字符串:列表内容的一个小示例是:

["machine learning","Apple","Finance","AI","Funding"] [“机器学习”,“苹果”,“金融”,“ AI”,“资金”]

I wish to convert these into vectors and use them for clustering purpose. 我希望将它们转换为向量,并将其用于聚类目的。 Is the context of these strings in the sentences considered while finding out their respective vectors? 找出它们各自的向量时,是否考虑了句子中这些字符串的上下文?

How should I go about with getting the vectors of these strings if i have just this list containing the strings? 如果我只有包含字符串的列表,该如何获取这些字符串的向量?

I have done this code so far.. 到目前为止,我已经完成了此代码。

 from gensim.models import Word2Vec 
    vec = Word2Vec(mylist)

PS Also, can I get a good reference/tutorial on Word2Vec ? PS另外,我可以在Word2Vec上获得良好的参考/教程吗?

Word2Vec is an artificial neural network method. Word2Vec是一种人工神经网络方法。 Word2Vec actually creates embeddings, which reflects the relationship among the words. Word2Vec实际上创建嵌入,以反映单词之间的关系。 The links below will help you get the complete code to implement Word2Vec. 下面的链接将帮助您获取实现Word2Vec的完整代码。

Some good links are this and this . 这个这个有些很好的联系。 For the 2nd link try his github repo for the detail code. 对于第二个链接,请尝试在他的github存储库中获取详细代码。 He is explaining only major part in the blog. 他仅在博客中解释主要部分。 Main article is this . 主要文章是这个

You can use the following code, to convert words to there corresponding numerical values. 您可以使用以下代码将单词转换为相应的数值。

word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

To find word vectors using word2vec you need a list of sentences not a list of strings. 要使用word2vec查找单词向量,您需要一个句子列表而不是字符串列表。

What word2vec does is, it tries to goes through every word in a sentence and for each word, it tries to predict the words around it in a specified window (mostly around 5) and adjusts the vector associated with that word so that the error is minimized. word2vec所做的是,它尝试遍历句子中的每个单词,并针对每个单词,尝试在指定的窗口(主要是5个左右)中预测其周围的单词,并调整与该单词相关的向量,从而使误差为最小化。

Obviously, this means that the order of words matter when finding word vectors. 显然,这意味着在找到单词向量时单词的顺序很重要。 If you just supply a list of strings without a meaningful order, you will not get a good embedding. 如果仅提供没有有意义顺序的字符串列表,则嵌入效果会很差。

I'm not sure, but I think you will find LDA better suited in this case, because your list of strings don't have inherent order in them. 我不确定,但是我认为您会发现LDA更适合这种情况,因为您的字符串列表中没有固有的顺序。

Answers to your 2 questions: 您的2个问题的答案:

  1. Is the context of these strings in the sentences considered while finding out their respective vectors? 找出它们各自的向量时,是否考虑了句子中这些字符串的上下文?
    Yes, word2vec creates one vector per word (or string since it can consider multiword expression as unique word, eg New York); 是的,word2vec每个单词(或字符串,因为它可以将多单词表达视为唯一单词,例如纽约)创建一个矢量; this vector describe the word by its context. 此向量通过上下文描述单词。 It assumes that similar words will appear with similar context. 它假定相似的词将在相似的上下文中出现。 The context is composed of the surrounding words (in a window, with bag-of-words or skip-gram assumption). 上下文由周围的单词组成(在一个窗口中,带有单词袋或跳格假设)。

  2. How should I go about with getting the vectors of these strings if i have just this list containing the strings? 如果我只有包含字符串的列表,该如何获取这些字符串的向量?
    You need more words. 您需要更多的单词。 Word2Vec outputs quality depends on the size of the training set. Word2Vec输出质量取决于训练集的大小。 Training Word2Vec on your data is a non-sense. 对您的数据进行Word2Vec培训是无稽之谈。

The links provided by @Beta are a good introduction/explanation. @Beta提供的链接是一个很好的介绍/解释。

word2vec + context = doc2vec

Build sentences from text you have and tag them with labels. 从您拥有的文本中构建句子,并用标签标记它们。

Train doc2vec on tagged sentences to get vectors for each label embedded in the same space as words. 在标记的句子上训练doc2vec ,以获取嵌入到与单词相同空间中的每个标签的向量。

Then you can do vector inference and get labels for arbitrary piece of text. 然后,您可以进行向量推断并获取任意文本的标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM