[英]Trouble connecting to GCP TPU VM
I followed along with the instructions to create a cloud TPU VM and run a custom neural.network as directed by the Run Tensorflow on TPU pod slices to a T. It's important to note that I have been able to initialize the cloud TPUs when running this model on google colab, but require more resources than can be provided there even when explicitly managing the memory of the code used.我按照说明创建云 TPU VM并按照Run Tensorflow on TPU pod slices to a T 的指示运行自定义 neural.network 。重要的是要注意,我已经能够在运行这个时初始化云 TPU model 在 google colab 上,但即使在显式管理所用代码的 memory 时,也需要比那里提供的资源更多的资源。
When I create the VM, I use the following command:创建虚拟机时,我使用以下命令:
gcloud compute tpus tpu-vm create test-tpu-vm --zone=us-central1-b --accelerator-type=v2-8 --version=tpu-vm-tf-2.11.0
Next I log into the instance like so:接下来我像这样登录实例:
gcloud compute tpus tpu-vm ssh test-tpu-vm --zone us-central1-b --project <project_id>
where I clone down the code repo as follows:我在哪里克隆代码仓库如下:
git clone https://github.com/messerb5467/kaggle-competitions.git
and need to install the pandas library as required by the script:并且需要按照脚本要求安装pandas库:
pip install pandas
After doing all of this, I run the script and get the following issue:完成所有这些后,我运行脚本并遇到以下问题:
messerb5467@t1v-n-3b61e142-w-0:~/kaggle-competitions/allstate-insurance-claims$ ./allstate_claims_data_nn.py test-tpu-vm
2022-12-29 23:06:08.675448: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-29 23:06:08.855095: I tensorflow/core/tpu/tpu_initializer_helper.cc:275] Libtpu path is: libtpu.so
D1229 23:06:09.026751179 12938 config.cc:113] gRPC EXPERIMENT tcp_frame_size_tuning OFF (default:OFF)
D1229 23:06:09.026775139 12938 config.cc:113] gRPC EXPERIMENT tcp_read_chunks OFF (default:OFF)
D1229 23:06:09.026782994 12938 config.cc:113] gRPC EXPERIMENT tcp_rcv_lowat OFF (default:OFF)
D1229 23:06:09.026790374 12938 config.cc:113] gRPC EXPERIMENT peer_state_based_framing OFF (default:OFF)
D1229 23:06:09.026797554 12938 config.cc:113] gRPC EXPERIMENT flow_control_fixes OFF (default:OFF)
D1229 23:06:09.026804675 12938 config.cc:113] gRPC EXPERIMENT memory_pressure_controller OFF (default:OFF)
D1229 23:06:09.026812324 12938 config.cc:113] gRPC EXPERIMENT periodic_resource_quota_reclamation ON (default:ON)
D1229 23:06:09.026819471 12938 config.cc:113] gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D1229 23:06:09.026826517 12938 config.cc:113] gRPC EXPERIMENT new_hpack_huffman_decoder OFF (default:OFF)
D1229 23:06:09.026833747 12938 config.cc:113] gRPC EXPERIMENT event_engine_client OFF (default:OFF)
D1229 23:06:09.026840808 12938 config.cc:113] gRPC EXPERIMENT monitoring_experiment ON (default:ON)
D1229 23:06:09.026847921 12938 config.cc:113] gRPC EXPERIMENT promise_based_client_call OFF (default:OFF)
I1229 23:06:09.027065091 12938 ev_epoll1_linux.cc:121] grpc epoll fd: 7
D1229 23:06:09.027080773 12938 ev_posix.cc:141] Using polling engine: epoll1
D1229 23:06:09.027107304 12938 dns_resolver_ares.cc:824] Using ares dns resolver
D1229 23:06:09.027394086 12938 lb_policy_registry.cc:45] registering LB policy factory for "priority_experimental"
D1229 23:06:09.027405540 12938 lb_policy_registry.cc:45] registering LB policy factory for "outlier_detection_experimental"
D1229 23:06:09.027414241 12938 lb_policy_registry.cc:45] registering LB policy factory for "weighted_target_experimental"
D1229 23:06:09.027422457 12938 lb_policy_registry.cc:45] registering LB policy factory for "pick_first"
D1229 23:06:09.027430746 12938 lb_policy_registry.cc:45] registering LB policy factory for "round_robin"
D1229 23:06:09.027444142 12938 lb_policy_registry.cc:45] registering LB policy factory for "ring_hash_experimental"
D1229 23:06:09.027470472 12938 lb_policy_registry.cc:45] registering LB policy factory for "grpclb"
D1229 23:06:09.027508895 12938 lb_policy_registry.cc:45] registering LB policy factory for "rls_experimental"
D1229 23:06:09.027531154 12938 lb_policy_registry.cc:45] registering LB policy factory for "xds_cluster_manager_experimental"
D1229 23:06:09.027539743 12938 lb_policy_registry.cc:45] registering LB policy factory for "xds_cluster_impl_experimental"
D1229 23:06:09.027548580 12938 lb_policy_registry.cc:45] registering LB policy factory for "cds_experimental"
D1229 23:06:09.027556928 12938 lb_policy_registry.cc:45] registering LB policy factory for "xds_cluster_resolver_experimental"
D1229 23:06:09.027565714 12938 certificate_provider_registry.cc:35] registering certificate provider factory for "file_watcher"
I1229 23:06:09.051036639 12938 socket_utils_common_posix.cc:336] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
2022-12-29 23:06:09.090117: I tensorflow/core/tpu/tpu_initializer_helper.cc:225] GetTpuPjrtApi not found
Traceback (most recent call last):
File "./allstate_claims_data_nn.py", line 145, in <module>
main()
File "./allstate_claims_data_nn.py", line 141, in main
allstate_nn_model = AllStateModelTrainer(args.tpu_name)
File "./allstate_claims_data_nn.py", line 34, in __init__
os.environ['TPU_LOAD_LIBRARY'] = 0
File "/usr/lib/python3.8/os.py", line 680, in __setitem__
value = self.encodevalue(value)
File "/usr/lib/python3.8/os.py", line 750, in encode
raise TypeError("str expected, not %s" % type(value).__name__)
TypeError: str expected, not int
D1229 23:06:12.209423135 12938 init.cc:190] grpc_shutdown starts clean-up now
messerb5467@t1v-n-3b61e142-w-0:~/kaggle-competitions/allstate-insurance-claims$
Even if I follow the message and make the 0 a string, it continues on to produce a core dump instead of running as one would expect.即使我按照消息将 0 设为字符串,它也会继续生成核心转储,而不是像预期的那样运行。 Any help would be mighty appreciated.
任何帮助将不胜感激。
I've tried using a string for TPU_LOAD_LIBRARY
instead of the documented integer:我试过为
TPU_LOAD_LIBRARY
使用字符串而不是记录的 integer:
export TPU_LOAD_LIBRARY=0
Use a TPU_NAME
of local
instead of test-tpu-vm
since TPU vms run directly on a TPU host.使用
local
的TPU_NAME
而不是test-tpu-vm
,因为 TPU 虚拟机直接在 TPU 主机上运行。
Unfortunately when following this, the errors start spinning out of control and I'm not able register with the TPU nodes at all despite the initialization working just fine in colab.不幸的是,当遵循这个时,错误开始失控,尽管初始化在 colab 中工作得很好,但我根本无法注册到 TPU 节点。
I imagine it has to be something simple and I'm just missing something somewhere.我想它必须是简单的东西,我只是在某处遗漏了一些东西。
This code uses a TPU pod to run.此代码使用 TPU pod 运行。 So, you would need to follow the instructions given here to create the pod.
因此,您需要按照此处给出的说明创建 Pod。 Note that you need to use the version
tpu-vm-tf-2.11.0-pod
and not tpu-vm-tf-2.11.0
when creating the pod.请注意,创建 pod 时需要使用版本
tpu-vm-tf-2.11.0-pod
而不是tpu-vm-tf-2.11.0
。
For eg,例如,
gcloud compute tpus tpu-vm create test-tpu-vm \
--zone=us-central1-a --accelerator-type=v2-32 \
--version=tpu-vm-tf-2.11.0-pod
For line 33 we should pass pod name, in your case test-tpu-vm
.对于第 33 行,我们应该传递 pod 名称,在您的例子中是
test-tpu-vm
。 So, the call to the trainer would be ./allstate_claims_data_nn.py test-tpu-vm
.因此,对培训师的调用将是
./allstate_claims_data_nn.py test-tpu-vm
。
However, in line 34 it is trying to set the environment variable with an integer. This will not work because this needs to be string when setting environment variable from inside python code in Ubuntu. However, if you set it as a string TPU would throw errors because TPUs need this environment variable as an integer. So, I would recommend skipping the init function and following this and use但是,在第 34 行中,它尝试使用 integer 设置环境变量。这将不起作用,因为在 Ubuntu 中的 python 代码内部设置环境变量时,这需要是字符串。但是,如果将其设置为字符串,TPU 将抛出错误,因为 TPU 需要此环境变量作为 integer。因此,我建议跳过init function 并遵循此并使用
export TPU_NAME=test-tpu-vm
export TPU_LOAD_LIBRARY=0
./allstate_claims_data_nn.py test-tpu-vm
(or modify the code to skip taking tpu name as an argument) (或修改代码以跳过将 tpu 名称作为参数)
This will get you past the TPU setup errors, there will be more code logic error from line 73 which you can continue to work on.这将使您克服 TPU 设置错误, 第 73行将出现更多代码逻辑错误,您可以继续处理这些错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.