简体   繁体   English

无法连接到 GCP TPU 虚拟机

[英]Trouble connecting to GCP TPU VM

I followed along with the instructions to create a cloud TPU VM and run a custom neural.network as directed by the Run Tensorflow on TPU pod slices to a T. It's important to note that I have been able to initialize the cloud TPUs when running this model on google colab, but require more resources than can be provided there even when explicitly managing the memory of the code used.我按照说明创建云 TPU VM并按照Run Tensorflow on TPU pod slices to a T 的指示运行自定义 neural.network 。重要的是要注意,我已经能够在运行这个时初始化云 TPU model 在 google colab 上,但即使在显式管理所用代码的 memory 时,也需要比那里提供的资源更多的资源。

When I create the VM, I use the following command:创建虚拟机时,我使用以下命令:

gcloud compute tpus tpu-vm create test-tpu-vm   --zone=us-central1-b   --accelerator-type=v2-8   --version=tpu-vm-tf-2.11.0

Next I log into the instance like so:接下来我像这样登录实例:

gcloud compute tpus tpu-vm ssh test-tpu-vm --zone us-central1-b --project <project_id>

where I clone down the code repo as follows:我在哪里克隆代码仓库如下:

git clone https://github.com/messerb5467/kaggle-competitions.git

and need to install the pandas library as required by the script:并且需要按照脚本要求安装pandas库:

pip install pandas

After doing all of this, I run the script and get the following issue:完成所有这些后,我运行脚本并遇到以下问题:

messerb5467@t1v-n-3b61e142-w-0:~/kaggle-competitions/allstate-insurance-claims$ ./allstate_claims_data_nn.py test-tpu-vm
2022-12-29 23:06:08.675448: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-29 23:06:08.855095: I tensorflow/core/tpu/tpu_initializer_helper.cc:275] Libtpu path is: libtpu.so
D1229 23:06:09.026751179   12938 config.cc:113]              gRPC EXPERIMENT tcp_frame_size_tuning               OFF (default:OFF)
D1229 23:06:09.026775139   12938 config.cc:113]              gRPC EXPERIMENT tcp_read_chunks                     OFF (default:OFF)
D1229 23:06:09.026782994   12938 config.cc:113]              gRPC EXPERIMENT tcp_rcv_lowat                       OFF (default:OFF)
D1229 23:06:09.026790374   12938 config.cc:113]              gRPC EXPERIMENT peer_state_based_framing            OFF (default:OFF)
D1229 23:06:09.026797554   12938 config.cc:113]              gRPC EXPERIMENT flow_control_fixes                  OFF (default:OFF)
D1229 23:06:09.026804675   12938 config.cc:113]              gRPC EXPERIMENT memory_pressure_controller          OFF (default:OFF)
D1229 23:06:09.026812324   12938 config.cc:113]              gRPC EXPERIMENT periodic_resource_quota_reclamation ON  (default:ON)
D1229 23:06:09.026819471   12938 config.cc:113]              gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D1229 23:06:09.026826517   12938 config.cc:113]              gRPC EXPERIMENT new_hpack_huffman_decoder           OFF (default:OFF)
D1229 23:06:09.026833747   12938 config.cc:113]              gRPC EXPERIMENT event_engine_client                 OFF (default:OFF)
D1229 23:06:09.026840808   12938 config.cc:113]              gRPC EXPERIMENT monitoring_experiment               ON  (default:ON)
D1229 23:06:09.026847921   12938 config.cc:113]              gRPC EXPERIMENT promise_based_client_call           OFF (default:OFF)
I1229 23:06:09.027065091   12938 ev_epoll1_linux.cc:121]     grpc epoll fd: 7

D1229 23:06:09.027080773   12938 ev_posix.cc:141]            Using polling engine: epoll1
D1229 23:06:09.027107304   12938 dns_resolver_ares.cc:824]   Using ares dns resolver
D1229 23:06:09.027394086   12938 lb_policy_registry.cc:45]   registering LB policy factory for "priority_experimental"
D1229 23:06:09.027405540   12938 lb_policy_registry.cc:45]   registering LB policy factory for "outlier_detection_experimental"
D1229 23:06:09.027414241   12938 lb_policy_registry.cc:45]   registering LB policy factory for "weighted_target_experimental"
D1229 23:06:09.027422457   12938 lb_policy_registry.cc:45]   registering LB policy factory for "pick_first"
D1229 23:06:09.027430746   12938 lb_policy_registry.cc:45]   registering LB policy factory for "round_robin"
D1229 23:06:09.027444142   12938 lb_policy_registry.cc:45]   registering LB policy factory for "ring_hash_experimental"
D1229 23:06:09.027470472   12938 lb_policy_registry.cc:45]   registering LB policy factory for "grpclb"
D1229 23:06:09.027508895   12938 lb_policy_registry.cc:45]   registering LB policy factory for "rls_experimental"
D1229 23:06:09.027531154   12938 lb_policy_registry.cc:45]   registering LB policy factory for "xds_cluster_manager_experimental"
D1229 23:06:09.027539743   12938 lb_policy_registry.cc:45]   registering LB policy factory for "xds_cluster_impl_experimental"
D1229 23:06:09.027548580   12938 lb_policy_registry.cc:45]   registering LB policy factory for "cds_experimental"
D1229 23:06:09.027556928   12938 lb_policy_registry.cc:45]   registering LB policy factory for "xds_cluster_resolver_experimental"
D1229 23:06:09.027565714   12938 certificate_provider_registry.cc:35] registering certificate provider factory for "file_watcher"
I1229 23:06:09.051036639   12938 socket_utils_common_posix.cc:336] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
2022-12-29 23:06:09.090117: I tensorflow/core/tpu/tpu_initializer_helper.cc:225] GetTpuPjrtApi not found
Traceback (most recent call last):
  File "./allstate_claims_data_nn.py", line 145, in <module>
    main()
  File "./allstate_claims_data_nn.py", line 141, in main
    allstate_nn_model = AllStateModelTrainer(args.tpu_name)
  File "./allstate_claims_data_nn.py", line 34, in __init__
    os.environ['TPU_LOAD_LIBRARY'] = 0
  File "/usr/lib/python3.8/os.py", line 680, in __setitem__
    value = self.encodevalue(value)
  File "/usr/lib/python3.8/os.py", line 750, in encode
    raise TypeError("str expected, not %s" % type(value).__name__)
TypeError: str expected, not int
D1229 23:06:12.209423135   12938 init.cc:190]                grpc_shutdown starts clean-up now
messerb5467@t1v-n-3b61e142-w-0:~/kaggle-competitions/allstate-insurance-claims$

Even if I follow the message and make the 0 a string, it continues on to produce a core dump instead of running as one would expect.即使我按照消息将 0 设为字符串,它也会继续生成核心转储,而不是像预期的那样运行。 Any help would be mighty appreciated.任何帮助将不胜感激。

I've tried using a string for TPU_LOAD_LIBRARY instead of the documented integer:我试过为TPU_LOAD_LIBRARY使用字符串而不是记录的 integer:

export TPU_LOAD_LIBRARY=0

Use a TPU_NAME of local instead of test-tpu-vm since TPU vms run directly on a TPU host.使用localTPU_NAME而不是test-tpu-vm ,因为 TPU 虚拟机直接在 TPU 主机上运行。

Unfortunately when following this, the errors start spinning out of control and I'm not able register with the TPU nodes at all despite the initialization working just fine in colab.不幸的是,当遵循这个时,错误开始失控,尽管初始化在 colab 中工作得很好,但我根本无法注册到 TPU 节点。

I imagine it has to be something simple and I'm just missing something somewhere.我想它必须是简单的东西,我只是在某处遗漏了一些东西。

This code uses a TPU pod to run.代码使用 TPU pod 运行。 So, you would need to follow the instructions given here to create the pod.因此,您需要按照此处给出的说明创建 Pod。 Note that you need to use the version tpu-vm-tf-2.11.0-pod and not tpu-vm-tf-2.11.0 when creating the pod.请注意,创建 pod 时需要使用版本tpu-vm-tf-2.11.0-pod而不是tpu-vm-tf-2.11.0

For eg,例如,

gcloud compute tpus tpu-vm create test-tpu-vm \
   --zone=us-central1-a   --accelerator-type=v2-32 \
   --version=tpu-vm-tf-2.11.0-pod

For line 33 we should pass pod name, in your case test-tpu-vm .对于第 33 行,我们应该传递 pod 名称,在您的例子中是test-tpu-vm So, the call to the trainer would be ./allstate_claims_data_nn.py test-tpu-vm .因此,对培训师的调用将是./allstate_claims_data_nn.py test-tpu-vm

However, in line 34 it is trying to set the environment variable with an integer. This will not work because this needs to be string when setting environment variable from inside python code in Ubuntu. However, if you set it as a string TPU would throw errors because TPUs need this environment variable as an integer. So, I would recommend skipping the init function and following this and use但是,在第 34 行中,它尝试使用 integer 设置环境变量。这将不起作用,因为在 Ubuntu 中的 python 代码内部设置环境变量时,这需要是字符串。但是,如果将其设置为字符串,TPU 将抛出错误,因为 TPU 需要此环境变量作为 integer。因此,我建议跳过init function 并遵循并使用

export TPU_NAME=test-tpu-vm
export TPU_LOAD_LIBRARY=0
./allstate_claims_data_nn.py test-tpu-vm

(or modify the code to skip taking tpu name as an argument) (或修改代码以跳过将 tpu 名称作为参数)

This will get you past the TPU setup errors, there will be more code logic error from line 73 which you can continue to work on.这将使您克服 TPU 设置错误, 第 73行将出现更多代码逻辑错误,您可以继续处理这些错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 检查 GCP 上的虚拟机是否连接到 GCP VPN 网关 - Check that VM on GCP is connecting to GCP VPN Gateway GCP 上新 TPU 虚拟机的列表文件 /etc/apt/sources.list 中的条目 11 格式错误 - Malformed entry 11 in list file /etc/apt/sources.list on fresh TPU VM on GCP 您可以在我的 GCP VM 上使用 Jupyter notebook 在 Google Cloud 中运行 TPU 训练吗? - Can you use a Jupyter notebook on my GCP VM to run TPU training in Google Cloud? 在 GCP 上的 Cloud TPU VM 上运行 Pytorch 给出 INVALID_ARGUMENT: No matching devices found for '/job:localservice/replica:0/task:0/device:TPU_SYSTEM:0' - Running Pytorch on Cloud TPU VM on GCP gives INVALID_ARGUMENT: No matching devices found for '/job:localservice/replica:0/task:0/device:TPU_SYSTEM:0' 为什么我的 Google Cloud TPU VM 实例无法识别 TPU? - Why is the TPU not recognized on my Google Cloud TPU VM instance? TPU VM 上的训练 model 因核心转储而中止 - Training model on TPU VM aborts with core dump 连接 VSCode 和 GCP - Connecting VSCode and GCP 如何在不使用 GCP 的情况下在 Colab 中加载用于 TPU 推理的数据? - How to load data for TPU inference in Colab, without using GCP? SSH 在 GCP VM 实例上过期 - SSH Expired on GCP VM Instance tpu-vm写云桶时出现403权限错误 - 403 permission error when tpu-vm writing cloud bucket
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM