简体   繁体   English

使用 Word2Vec 进行句子的词嵌入

[英]Using Word2Vec for word embedding of sentences

I am trying to create an emotion recognition model and for that I am using Word2Vec.我正在尝试创建一个情感识别 model,为此我正在使用 Word2Vec。 I have a tokenized pandas data frame x_train['Utterance'] and I have used我有一个标记化的 pandas 数据框x_train['Utterance']并且我使用过

model = gensim.models.Word2Vec(x_train['Utterance'], min_count = 1, vector_size = 100)

to create a vocabulary.创建一个词汇表。 Then, I created a dictionary embeddings_index that has as key the words and as value the vector embedding.然后,我创建了一个字典 embeddings_index,其中包含单词作为键,向量嵌入作为值。 I created a new column in my data frame where every word is replaced by the respective vector.我在我的数据框中创建了一个新列,其中每个单词都被相应的向量替换。

x_train['vector'] = x_train['Utterance'].explode().map(embeddings_index).groupby(level=0).agg(list)

Finally, I used pad_sequences so that each instance of the data set is padded to the size of the instance with biggest length (because the data set initially was made of sentences of different sizes):最后,我使用了 pad_sequences 以便将数据集的每个实例填充到具有最大长度的实例的大小(因为数据集最初是由不同大小的句子组成的):

x_train['vector'] = tf.keras.utils.pad_sequences(x_train.vector, maxlen = 30, dtype='float64', padding='post', truncating='post', value=0).tolist()

If min_count = 1 , one of the parameters of Word2Vec, everything is alright and x_train['vector'] is what I pretend, a column of the embeddings vectors of the tokenized sentences in x_train['Utterance'] .如果min_count = 1 ,Word2Vec 的参数之一,一切都很好,并且x_train['vector']是我假装的, x_train['Utterance']中标记化句子的嵌入向量的列。 However, when min_count != 1 , the created vocabulary only has the words which appears more than the min_count value in x_train['Utterance'] .但是,当min_count != 1时,创建的词汇表中只有出现次数多于x_train['Utterance']min_count值的单词。 Because of this, when creating x_train['vector'] mapping the dictionary embeddings_index , the new column will contain lists like the following [nan, [0.20900646, 0.76452744, 2.3117824], [0.... , where nan corresponds to words that are not in the dictionary.因此,当创建映射字典embeddings_indexx_train['vector']时,新列将包含如下列表[nan, [0.20900646, 0.76452744, 2.3117824], [0.... ,其中nan对应于字典里没有。 Because of this nan , when using the tf.keras.utils.pad_sequences I get the following error message: ValueError: setting an array element with a sequence.由于这个nan ,当使用tf.keras.utils.pad_sequences时,我收到以下错误消息: ValueError: setting an array element with a sequence。 The requested array has an inhomogeneous shape after 1 dimensions.请求的数组在 1 维后具有不均匀的形状。 The detected shape was (3,) + inhomogeneous part.检测到的形状为 (3,) + 不均匀部分。

I would like to remove the nan from each list but I am not being able.我想从每个列表中删除nan但我不能。 Tried the fillna('') however it just removes the nan but keeps an empty index on the list.尝试了fillna('')但它只是删除了nan但在列表中保留了一个空索引。 Any idea?任何想法?

It seems the problem may that x_train['Utterance'] includes a bunch of words that (after min_count trimming) aren't in the model.似乎问题可能是x_train['Utterance']包含一堆单词(在min_count修剪之后)不在 model 中。 As a result you may be both miscalculating the true longest-text (because you're counting with unknown words), and get some nonsense values (where no word-vector was available for a low-frequency word)结果,您可能既错误地计算了真正的最长文本(因为您正在计算未知单词),又得到了一些无意义的值(其中没有可用于低频单词的词向量)

The most simple fix would be to stop using the original x_train['Utterance'] as your texts for steps that will be limited to a smaller vocabulary of only those words with word-vectors.最简单的解决方法是停止使用原始x_train['Utterance']作为您的文本,这些步骤将仅限于具有词向量的那些词的较小词汇表。 Instead, pre-filter those text to eliminate words not present in the word-vector model.相反,预过滤这些文本以消除单词向量 model 中不存在的单词。 For example:例如:

cleaned_texts = [[word for word in text if word in model.wv] 
                 for text in x_train['Utterance']]

Then, only use cleaned_texts for anything driving word-vector lookups, including your calculation of the longest text.然后,仅将cleaned_texts用于驱动词向量查找的任何内容,包括计算最长文本。

Other notes:其他注意事项:

  • you probably don't need to create your own embeddings_index dict-like object: the Word2Vec model already offers a dict-like interface, returning a word-vector per lookup key, via the instance of KeyedVectors in its .wv property.您可能不需要创建自己的embeddings_index类似字典的 object: Word2Vec model 已经提供了类似字典的接口,通过其.wv属性中的KeyedVectors实例返回每个查找键的字向量。

  • if your other libraries or hardware considerations don't require float64 values, you might just want to stick with float32 -width values – that's what the Word2Vec model will train into word-vectors, they take half as much memory, and results from these kinds of models are rarely improved, and sometimes slowed, by using higher-precisions.如果您的其他库或硬件考虑不需要float64值,您可能只想坚持使用float32 -width 值——这就是Word2Vec model 将训练成词向量的内容,它们占用 memory 的一半,以及这些类型的结果的模型很少通过使用更高的精度得到改进,有时甚至会减慢。

  • you could also consider creating a FastText model instead of plain Word2Vec - such a model will always return a vector, even for unknown words, synthesized from word-fragment-vectors that it learns while training.您还可以考虑创建一个FastText model 而不是普通的Word2Vec - 这样的 model 将始终返回一个向量,即使对于未知词也是如此,它是从它在训练时学习的词片段向量合成的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM