简体   繁体   English

词嵌入不提供词之间的预期关系

[英]Word-embedding does not provide expected relations between words

I am trying to train a word embedding to a list of repeated sentences where only the subject changes.我正在尝试将单词嵌入训练到只有主题发生变化的重复句子列表中。 I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding.我期望与主题相对应的生成向量在训练后提供强相关性,正如词嵌入所期望的那样。 However, the angle between the vectors of subjects is not always larger than the angle between subjects and a random word.然而,主题向量之间的角度并不总是大于主题与随机词之间的角度。

Man   is going to write a very long novel that no one can read.
Woman is going to write a very long novel that no one can read.
Boy   is going to write a very long novel that no one can read.

The code is based on pytorch tutorial :代码基于pytorch 教程

import torch
from torch import nn
import torch.nn.functional as F
import numpy as np

class EmbedTrainer(nn.Module):
    def __init__(self, d_vocab, d_embed, d_context):
        super(EmbedTrainer, self).__init__()
        self.embed = nn.Embedding(d_vocab, d_embed)
        self.fc_1 = nn.Linear(d_embed * d_context, 128)
        self.fc_2 = nn.Linear(128, d_vocab)

    def forward(self, x):
        x = self.embed(x).view((1, -1)) # flatten after embedding
        x = self.fc_2(F.relu(self.fc_1(x)))
        x = F.log_softmax(x, dim=1)
        return x

text = " ".join(["{} is going to write a very long novel that no one can read.".format(x) for x in ["Man", "Woman", "Boy"]])
text_split = text.split()
trigrams = [([text_split[i], text_split[i+1]], text_split[i+2]) for i in range(len(text_split)-2)]
dic = list(set(text.split()))
tok_to_ids = {w:i for i, w in enumerate(dic)}
tokens_text = text.split(" ")
d_vocab, d_embed, d_context = len(dic), 10, 2

""" Train """
loss_func = nn.NLLLoss()
model = EmbedTrainer(d_vocab, d_embed, d_context)
print(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

losses = []
epochs = 10
for epoch in range(epochs):
    total_loss = 0
    for input, target in trigrams:
        tok_ids = torch.tensor([tok_to_ids[tok] for tok in input], dtype=torch.long)
        target_id = torch.tensor([tok_to_ids[target]], dtype=torch.long)
        model.zero_grad()
        log_prob = model(tok_ids)
        #if total_loss == 0: print("train ", log_prob, target_id)
        loss = loss_func(log_prob, target_id)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    print(total_loss)
    losses.append(total_loss)

embed_map = {}
for word in ["Man", "Woman", "Boy", "novel"]:
    embed_map[word] = model.embed.weight[tok_to_ids[word]]
    print(word, embed_map[word])

def angle(a, b):
    from numpy.linalg import norm
    a, b = a.detach().numpy(), b.detach().numpy()
    return np.dot(a, b) / norm(a) / norm(b)

print("man.woman", angle(embed_map["Man"], embed_map["Woman"]))
print("man.novel", angle(embed_map["Man"], embed_map["novel"]))

I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding我期望与主题对应的生成向量在训练后提供强相关性,正如词嵌入所期望的那样

I don't really think you'll achieve that kind of result with only 3 sentences and like 40 iterations in 10 epochs (plus most of the data in your 40 iterations is repeated).我真的不认为你会用 3 个句子和 10 个 epoch 中的 40 次迭代(加上你的 40 次迭代中的大部分数据是重复的)来实现那种结果。

maybe try downloading a couple of free datasets out there, or try your own data with a proven model like a genism model.也许尝试在那里下载几个免费的数据集,或者尝试使用经过验证的 model (如 genism model)自己的数据。

I'll give you the code for training a gensim model, so you can test your dataset on another model and see if the problem comes from your data or from your model.我会给你训练 gensim model 的代码,这样你就可以在另一个 model 上测试你的数据集,看看问题是来自你的数据还是来自你的 model。 I've tested similar gensim models on datasets with millions of sentences and it worked like a charm, for smaller datasets you might want to change the parameters.我已经在包含数百万个句子的数据集上测试了类似的 gensim 模型,它就像一个魅力,对于较小的数据集,您可能想要更改参数。

from gensim.models import Word2Vec
from multiprocessing import cpu_count


corpus_path = 'eachLineASentence.txt'
vecSize = 300
winSize = 5
numWorkers = cpu_count()-1
epochs = 20
minCount = 5
skipGram = False
modelName = f'mymodel.model'

model = Word2Vec(corpus_file=corpus_path,
                size=vecSize,
                window=winSize,
                min_count=minCount,
                workers=numWorkers,
                iter=epochs,
                sg=skipGram)
model.save(modelName)

PS I don't think it's a good idea to use the keyword input as a variable in your code. PS 我认为在代码中使用关键字输入作为变量不是一个好主意。

It's most probably the training size.这很可能是训练规模。 Training a 128d embedding is definitely overkill.训练 128d 嵌入绝对是矫枉过正。 Rule of thumb from the the google developers blog :来自谷歌开发者博客的经验法则:

Why is the embedding vector size 3 in our example?为什么在我们的示例中嵌入向量大小为 3? Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:好吧,以下“公式”提供了关于嵌入维数的一般经验法则:
embedding_dimensions = number_of_categories**0.25

That is, the embedding vector dimension should be the 4th root of the number of categories.也就是说,嵌入向量维度应该是类别数的第 4 根。 Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:因为我们在这个例子中的词汇量是 81,所以推荐的维数是 3:
3 = 81**0.25

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM