Failed to connect to TensorFlow master: TPU worker 可能沒有准備好或者 TensorFlow master 地址不正確

Question

我在兩年內第三次報名參加 Tensor Research Cloud (TRC) 計划。 現在我勉強創建了一個可搶占的 v3-8 TPU。 在此之前，我可以高效地分配五個不可搶占的 v3-8 TPU。 即使有這種分配（可搶占和不可搶占），TPU 也被列為READY和HEALTHY 。 但是，當我嘗試從預訓練腳本訪問它時，我遇到了這個我以前從未遇到過的錯誤：

Failed to connect to the Tensorflow master. The TPU worker may not be ready (still scheduling), or the Tensorflow master address is incorrect.

我知道 TensorFlow 主地址是正確的，我已經檢查過 TPU 是健康的並且准備好了。 我還仔細檢查了我的代碼是否正確創建了 TensorFlow session 並指定了 TPU 地址。

是什么導致了此錯誤消息，我該如何進行故障排除和修復？

我還嘗試了來自https://www.tensorflow.org/guide/tpu的這段代碼。 請注意，我使用的不是 Colab，而是 Google Cloud Platform。

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='pretrain-1')
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

任何我被困在：

INFO:tensorflow:Initializing the TPU system: pretrain-1

但是，我期待這樣的事情：

INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
2022-12-20 13:08:56.187870: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: grpc://10.99.59.162:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.99.59.162:8470
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Finished initializing TPU system.
All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU')]

編輯：我從新的 Tensor Research Cloud (TRC) 帳戶成功訪問了具有相同配置的 TPU。 但是，問題仍然存在於之前的 TRC 帳戶中。 我懷疑這可能是 Google Cloud Platform (GCP) 配置的問題。

Answer 1

嘗試在本地運行代碼。 使用ctpu up並將代碼下載到它為您spun up的 VM 上，不過，它可能工作正常。 （就像暫停 VM (ctpu pause) 然后再次將其提升 (ctpu up)。

另請參閱官方文檔故障排除 TensorFlow - TPU了解更多信息。

Answer 2

我通過刪除所有 TPU 和 VM 實例然后禁用並重新啟用所有 API 解決了這個問題。

該問題可能與啟用服務期間與 GPU 集群的 VPN 連接有關。

Failed to connect to TensorFlow master: TPU worker 可能沒有准備好或者 TensorFlow master 地址不正確

問題描述

2 個解決方案

解決方案1
0 2022-12-30 12:53:01

解決方案2
0 已采納 2023-01-01 12:47:38

Failed to connect to TensorFlow master: TPU worker 可能沒有准備好或者 TensorFlow master 地址不正確

問題描述

2 個解決方案

解決方案1 0 2022-12-30 12:53:01

解決方案2 0 已采納 2023-01-01 12:47:38

解決方案1
0 2022-12-30 12:53:01

解決方案2
0 已采納 2023-01-01 12:47:38