繁体   English   中英

Tensorflow 分布式训练挂起并出现 RecvTensor 取消警告

[英]Tensorflow distributed training hangs with RecvTensor cancelled warning

我正在使用 tensorflow 2.1.0 的估计器 api 在参数服务器分布式模式下训练 model。 启动训练的代码如下:

# model_fn is defined with model logit
GATNE = tf.estimator.Estimator(model_fn=model_fn, model_dir=args.model_dir, params=model_params, config=config)

# obtain different train and eval files for different workers
# files are stored on shared hdfs
train_files, eval_files = get_files_for_worker()

# the input_fn is going to raises an end-of-input exception after epochs
train_spec = tf.estimator.TrainSpec(
            input_fn=lambda: input_fn(train_files, batch_size=args.batch_size, num_epochs=args.epochs),
            max_steps=None)

eval_spec = tf.estimator.EvalSpec(
            input_fn=lambda: input_fn(eval_files, batch_size=args.batch_size, num_epochs=1),
            steps=1000,
            exporters=exporter,
            throttle_secs=300)

tf.estimator.train_and_evaluate(GATNE, train_spec, eval_spec)

print("Done.")

我正在使用 2 ps 服务器和 5 个工作人员(索引范围从 0 到 4)训练 model。 工人 0 为主管,工人 1 为评估员,其余为普通工人。

训练任务启动后, worker 4 成功完成了训练任务,这里是这个worker的一些日志(我在训练过程中添加了一些日志信息,比如******************* [2021-04-01 12:05] global_step: 4331 *************************跟踪步骤):

..............
2021-04-01 12:05:13.483 INFO:tensorflow:******************* [2021-04-01 12:05] global_step: 4331 *************************
2021-04-01 12:05:55.315 INFO:tensorflow:******************* [2021-04-01 12:05] global_step: 4535 *************************
2021-04-01 12:06:40.621 INFO:tensorflow:******************* [2021-04-01 12:06] global_step: 4733 *************************
2021-04-01 12:07:30.341 INFO:tensorflow:******************* [2021-04-01 12:07] global_step: 4936 *************************
2021-04-01 12:08:16.981 INFO:tensorflow:loss = 59.152977, step = 5136 (434.022 sec)
2021-04-01 12:08:16.981 INFO:tensorflow:******************* [2021-04-01 12:08] global_step: 5136 *************************
2021-04-01 12:09:04.206 INFO:tensorflow:******************* [2021-04-01 12:09] global_step: 5337 *************************
2021-04-01 12:09:51.85 INFO:tensorflow:******************* [2021-04-01 12:09] global_step: 5537 *************************
2021-04-01 12:10:38.679 INFO:tensorflow:******************* [2021-04-01 12:10] global_step: 5742 *************************
2021-04-01 12:11:25.781 INFO:tensorflow:******************* [2021-04-01 12:11] global_step: 5948 *************************
2021-04-01 12:12:07.822 INFO:tensorflow:******************* [2021-04-01 12:12] global_step: 6123 *************************
2021-04-01 12:12:54.258 INFO:tensorflow:******************* [2021-04-01 12:12] global_step: 6327 *************************
2021-04-01 12:13:41.683 INFO:tensorflow:******************* [2021-04-01 12:13] global_step: 6528 *************************
2021-04-01 12:14:21.60 INFO:tensorflow:Loss for final step: 62.46709.
2021-04-01 12:14:21.60 Done.

但是,所有其他工作人员都会收到以下警告: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842并永远挂起。

部分worker 2的日志:

......
2021-04-01 12:11:14.380 INFO:tensorflow:******************* [2021-04-01 12:11] global_step: 5899 *************************
2021-04-01 12:11:56.730 INFO:tensorflow:******************* [2021-04-01 12:11] global_step: 6077 *************************
2021-04-01 12:12:39.956 INFO:tensorflow:******************* [2021-04-01 12:12] global_step: 6269 *************************
2021-04-01 12:13:26.957 INFO:tensorflow:******************* [2021-04-01 12:13] global_step: 6468 *************************
2021-04-01 12:14:14.77 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6668 *************************
2021-04-01 12:14:50.611 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6821 *************************
2021-04-01 12:15:19.129 2021-04-01 12:15:19.128988: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.129 2021-04-01 12:15:19.129495: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132203: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132261: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132339: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.132 2021-04-01 12:15:19.132388: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842

和工人3:

2021-04-01 12:14:22.586 INFO:tensorflow:loss = 60.513123, step = 6703 (462.406 sec)
2021-04-01 12:14:22.586 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6703 *************************
2021-04-01 12:14:58.652 INFO:tensorflow:******************* [2021-04-01 12:14] global_step: 6855 *************************
2021-04-01 12:15:30.105 INFO:tensorflow:******************* [2021-04-01 12:15] global_step: 6984 *************************
2021-04-01 12:15:53.527 INFO:tensorflow:******************* [2021-04-01 12:15] global_step: 7084 *************************
2021-04-01 12:16:17.568 INFO:tensorflow:******************* [2021-04-01 12:16] global_step: 7184 *************************
2021-04-01 12:16:41.582 INFO:tensorflow:******************* [2021-04-01 12:16] global_step: 7284 *************************
2021-04-01 12:17:05.140 INFO:tensorflow:******************* [2021-04-01 12:17] global_step: 7384 *************************
2021-04-01 12:17:22.976 INFO:tensorflow:******************* [2021-04-01 12:17] global_step: 7485 *************************
2021-04-01 12:17:36.575 2021-04-01 12:17:36.574901: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363

这些工人只是挂起,我不知道为什么。 从 ps 服务器日志中,也有一些类似的警告:

ps服务器0:

2021-04-01 11:48:01.233 2021-04-01 11:48:01.232972: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
2021-04-01 12:14:20.957 2021-04-01 12:14:20.957449: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:15:19.133 2021-04-01 12:15:19.132949: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133103: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133120: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133131: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133141: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133151: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.133 2021-04-01 12:15:19.133174: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:17:35.837 2021-04-01 12:17:35.836897: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577149: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577221: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577233: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577312: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577323: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577330: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577339: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577348: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577355: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577363: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.577 2021-04-01 12:17:36.577369: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.579 2021-04-01 12:17:36.579436: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.

ps服务器1:

2021-04-01 11:46:28.368 2021-04-01 11:46:28.368034: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
2021-04-01 12:14:20.955 2021-04-01 12:14:20.955100: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 42171233248859519
2021-04-01 12:14:20.958 2021-04-01 12:14:20.957943: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130513: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130639: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130660: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.130 2021-04-01 12:15:19.130680: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134413: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134734: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134767: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.134 2021-04-01 12:15:19.134853: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134870: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134908: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134976: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.134993: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135043: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135084: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135120: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135133: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135177: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:15:19.135 2021-04-01 12:15:19.135212: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 44977866739330842
2021-04-01 12:17:35.837 2021-04-01 12:17:35.837269: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.
2021-04-01 12:17:36.578 2021-04-01 12:17:36.578073: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.578 2021-04-01 12:17:36.578184: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.578 2021-04-01 12:17:36.578216: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 48095428538436363
2021-04-01 12:17:36.580 2021-04-01 12:17:36.580204: I tensorflow/core/distributed_runtime/worker.cc:204] Cancellation requested for RunGraph.

主管和评估员的日志看起来不错。 我用谷歌搜索了RecvTensor cancelled警告,但找不到足够的信息。 也许这是一个网络问题? 但是每次我启动培训任务时,总是会出现这个警告并且工人被绞死。 所以我想这不是偶尔的。

我尝试了几次,发现总会有一个工人成功完成,然后其他工人就会挂起。

有人可以帮我解决这个问题吗?

我运行了几次任务,发现一个培训工人将首先完成,然后所有其他培训工人开始挂起。 所以我想完成的培训工人对其他工人有一些影响。 让第一个完成的培训工人不“完成”怎么样? 所以我给每个训练工人在训练代码后加了一个10分钟的睡眠,为了让工人没有“完成”。 这奏效了。 在第一个完成的工人睡觉时,其他培训工人继续正常培训,并在 10 分钟内完成所有培训。 到目前为止,这是我的解决方案。 仍然对其他解决方案持开放态度。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM