带有 StellarGraph 的 Watch-your-step 模型无法在 GPU 上运行

Question

I am trying to train a large graph-embedding using WatchYourStep algorithm using StellarGraph.我正在尝试使用 StellarGraph 使用 WatchYourStep 算法训练大型图嵌入。

For some reason, the model is only trained on a CPU a nd not utilizing the GPUs .出于某种原因，该模型仅在 CPU 上训练，而不使用 GPU 。
using:使用：

TensorFlow-gpu 2.3.1 TensorFlow-GPU 2.3.1
having 2 GPUs , cuda 10.1有 2 个 GPU，cuda 10.1
running inside an nvidia-docker container.在 nvidia-docker 容器内运行。
I know that tesnorflow do find the GPUs.我知道 tesnorflow 确实找到了 GPU。 ( tf.debugging.set_log_device_placement(True) ) ( tf.debugging.set_log_device_placement(True) )
I have tried to run under with tf.device('/GPU:0'):我试图在with tf.device('/GPU:0'):
I have tried to run it with tf.distribute.MirroredStrategy() .我试图用tf.distribute.MirroredStrategy()运行它。
Tried to uninstall tensorflow and reinstall tensorflow-gpu.尝试卸载 tensorflow 并重新安装 tensorflow-gpu。

Nevertheless, when running nvidia-smi , I don't see any activity on the GPUs, and the training is very slow.尽管如此，在运行nvidia-smi 时，我没有看到 GPU 上的任何活动，而且训练速度非常慢。
How to debug this?如何调试这个？

def watch_your_step_model():
    '''use the config to geenrate the WatchYourStep model'''
    cfg = load_config()
    generator           = generator_for_watch_your_step()
    num_walks           = cfg['num_walks']
    embedding_dimension = cfg['embedding_dimension']
    learning_rate       = cfg['learning_rate']
    
    wys = WatchYourStep(
        generator,
        num_walks=num_walks,
        embedding_dimension=embedding_dimension,
        attention_regularizer=regularizers.l2(0.5),
    )
    
    x_in, x_out = wys.in_out_tensors()
    model = Model(inputs=x_in, outputs=x_out)
    model.compile(loss=graph_log_likelihood, optimizer=optimizers.Adam(learning_rate))
    return model, generator, wys

def train_watch_your_step_model(epochs = 3000):
    cfg = load_config()
    batch_size      = cfg['batch_size']
    steps_per_epoch = cfg['steps_per_epoch']
    callbacks, checkpoint_file = watch_your_step_callbacks(cfg)
    
    # strategy = tf.distribute.MirroredStrategy()
    # print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
    # with strategy.scope():
    
    model, generator, wys = watch_your_step_model()

    train_gen = generator.flow(batch_size=batch_size, num_parallel_calls=8)
    train_gen.prefetch(20480000)

    history = model.fit(
        train_gen, 
        epochs=epochs, 
        verbose=1, 
        steps_per_epoch=steps_per_epoch,
        callbacks = callbacks
    )
     
    copy_last_trained_wys_weights_to_data()
    
    return history, checkpoint_file

with tf.device('/GPU:0'):
    train_watch_your_step_model()

Answer 1

I just followed this instructions : https://github.com/stellargraph/stellargraph/issues/546 .我只是按照以下说明操作： https : //github.com/stellargraph/stellargraph/issues/546 。

It worked for me.它对我有用。

Basically you have to edit the file setup.py from stellargraph github and remove the tensorflow requirement (line 25 and 27 https://github.com/stellargraph/stellargraph/blob/develop/setup.py ) .基本上，您必须从 stellargraph github 编辑文件 setup.py 并删除 tensorflow 要求（第 25 和 27 行https://github.com/stellargraph/stellargraph/blob/develop/setup.py ）。

带有 StellarGraph 的 Watch-your-step 模型无法在 GPU 上运行

问题描述

1 个解决方案

解决方案1
0 2021-01-04 17:17:49

带有 StellarGraph 的 Watch-your-step 模型无法在 GPU 上运行

问题描述

1 个解决方案

解决方案1 0 2021-01-04 17:17:49

解决方案1
0 2021-01-04 17:17:49