在GPU上运行的TensorFlow Word2Vec模型

Question

In this TensorFlow example a training of skip-gram Word2Vec model described. 在此TensorFlow示例中，描述了跳过语法Word2Vec模型的训练。 It contains the following code fragment, which explicitly requires CPU device for computations, ie tf.device('/cpu:0') : 它包含以下代码片段，该片段明确要求使用CPU设备进行计算，即tf.device('/cpu:0') ：

batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
skip_window = 1  # How many words to consider left and right.
num_skips = 2  # How many times to reuse an input to generate a label.

# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16  # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))
num_sampled = 64  # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):
    # Input data.
    train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

    # Variables.
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    softmax_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))

    # Model.
    # Look up embeddings for inputs.
    embed = tf.nn.embedding_lookup(embeddings, train_dataset)

    # Compute the softmax loss, using a sample of the negative labels each time.
    loss = tf.reduce_mean(
        tf.nn.sampled_softmax_loss(weights=softmax_weights,
                                   biases=softmax_biases, inputs=embed,
                                   labels=train_labels, num_sampled=num_sampled,
                                   num_classes=vocabulary_size))

    # Optimizer.
    # Note: The optimizer will optimize the softmax_weights AND the embeddings.
    # This is because the embeddings are defined as a variable quantity and the
    # optimizer's `minimize` method will by default modify all variable quantities 
    # that contribute to the tensor it is passed.
    # See docs on `tf.train.Optimizer.minimize()` for more details.
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

    # Compute the similarity between minibatch examples and all embeddings.
    # We use the cosine distance:
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

When trying switch to GPU, the following exception is raised: 尝试切换到GPU时，引发以下异常：

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'Variable_2/Adagrad': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. InvalidArgumentError （请参见上面的回溯）：无法为操作“ Variable_2 / Adagrad”分配设备：无法满足显式设备规范“ / device：GPU：0”，因为没有可用的GPU设备支持的内核。

I wonder what is the reason why the provided graph cannot be computed on GPU? 我想知道为什么提供的图形无法在GPU上计算的原因是什么？ Does it happen due to tf.int32 type? 是否由于tf.int32类型而发生？ Or should I switch to another optimizer? 还是应该切换到另一个优化器？ In other words, is there any way to make possible processing Word2Vec model on GPU? 换句话说，有什么方法可以在GPU上处理Word2Vec模型吗？ (Without types casting). （无类型转换）。

UPDATE 更新

Following Akshay Agrawal recommendation, here is an updated fragment of the original code that achieves required result: 遵循Akshay Agrawal的建议，这是原始代码的更新片段，可以达到所需的结果：

with graph.as_default(), tf.device('/gpu:0'):
    # Input data.
    train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    softmax_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))    
    embed = tf.nn.embedding_lookup(embeddings, train_dataset)

    with tf.device('/cpu:0'):
        loss = tf.reduce_mean(
            tf.nn.sampled_softmax_loss(weights=softmax_weights,
                                       biases=softmax_biases,
                                       inputs=embed,
                                       labels=train_labels,
                                       num_sampled=num_sampled,
                                       num_classes=vocabulary_size))

    optimizer = tf.train.AdamOptimizer(0.001).minimize(loss)

    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

Answer 1

The error is raised because AdagradOptimizer does not have a GPU kernel for its sparse apply operation; 由于AdagradOptimizer的稀疏应用操作没有GPU内核，因此引发了错误。 a sparse apply is triggered because differentiating through the embedding lookup results in a sparse gradient. 触发稀疏应用是因为通过嵌入查找进行区分会导致稀疏渐变。

GradientDescentOptimizer and AdamOptimizer do support sparse apply operations. GradientDescentOptimizer和AdamOptimizer确实支持稀疏应用操作。 If you were to switch to one of these optimizers, you would unfortunately see another error: tf.nn.sampled_softmax_loss appears to create an op that does not have a GPU kernel. 如果要切换到这些优化器之一，则很可能会看到另一个错误：tf.nn.sampled_softmax_loss似乎是在创建没有GPU内核的op。 To get around that, you could wrap the loss = tf.reduce_mean(... line with a with tf.device('/cpu:0'): context, though doing so would introduce cpu-gpu communication overhead. 为了解决这个问题，您可以将loss = tf.reduce_mean(...行用with tf.device('/cpu:0'): context换行，尽管这样做会引入cpu-gpu通信开销。

在GPU上运行的TensorFlow Word2Vec模型

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-11-22 18:55:44

在GPU上运行的TensorFlow Word2Vec模型

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-11-22 18:55:44

解决方案1
2 已采纳 2017-11-22 18:55:44