简体   繁体   中英

Trouble connecting to GCP TPU VM

I followed along with the instructions to create a cloud TPU VM and run a custom neural.network as directed by the Run Tensorflow on TPU pod slices to a T. It's important to note that I have been able to initialize the cloud TPUs when running this model on google colab, but require more resources than can be provided there even when explicitly managing the memory of the code used.

When I create the VM, I use the following command:

gcloud compute tpus tpu-vm create test-tpu-vm   --zone=us-central1-b   --accelerator-type=v2-8   --version=tpu-vm-tf-2.11.0

Next I log into the instance like so:

gcloud compute tpus tpu-vm ssh test-tpu-vm --zone us-central1-b --project <project_id>

where I clone down the code repo as follows:

git clone https://github.com/messerb5467/kaggle-competitions.git

and need to install the pandas library as required by the script:

pip install pandas

After doing all of this, I run the script and get the following issue:

messerb5467@t1v-n-3b61e142-w-0:~/kaggle-competitions/allstate-insurance-claims$ ./allstate_claims_data_nn.py test-tpu-vm
2022-12-29 23:06:08.675448: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-29 23:06:08.855095: I tensorflow/core/tpu/tpu_initializer_helper.cc:275] Libtpu path is: libtpu.so
D1229 23:06:09.026751179   12938 config.cc:113]              gRPC EXPERIMENT tcp_frame_size_tuning               OFF (default:OFF)
D1229 23:06:09.026775139   12938 config.cc:113]              gRPC EXPERIMENT tcp_read_chunks                     OFF (default:OFF)
D1229 23:06:09.026782994   12938 config.cc:113]              gRPC EXPERIMENT tcp_rcv_lowat                       OFF (default:OFF)
D1229 23:06:09.026790374   12938 config.cc:113]              gRPC EXPERIMENT peer_state_based_framing            OFF (default:OFF)
D1229 23:06:09.026797554   12938 config.cc:113]              gRPC EXPERIMENT flow_control_fixes                  OFF (default:OFF)
D1229 23:06:09.026804675   12938 config.cc:113]              gRPC EXPERIMENT memory_pressure_controller          OFF (default:OFF)
D1229 23:06:09.026812324   12938 config.cc:113]              gRPC EXPERIMENT periodic_resource_quota_reclamation ON  (default:ON)
D1229 23:06:09.026819471   12938 config.cc:113]              gRPC EXPERIMENT unconstrained_max_quota_buffer_size OFF (default:OFF)
D1229 23:06:09.026826517   12938 config.cc:113]              gRPC EXPERIMENT new_hpack_huffman_decoder           OFF (default:OFF)
D1229 23:06:09.026833747   12938 config.cc:113]              gRPC EXPERIMENT event_engine_client                 OFF (default:OFF)
D1229 23:06:09.026840808   12938 config.cc:113]              gRPC EXPERIMENT monitoring_experiment               ON  (default:ON)
D1229 23:06:09.026847921   12938 config.cc:113]              gRPC EXPERIMENT promise_based_client_call           OFF (default:OFF)
I1229 23:06:09.027065091   12938 ev_epoll1_linux.cc:121]     grpc epoll fd: 7

D1229 23:06:09.027080773   12938 ev_posix.cc:141]            Using polling engine: epoll1
D1229 23:06:09.027107304   12938 dns_resolver_ares.cc:824]   Using ares dns resolver
D1229 23:06:09.027394086   12938 lb_policy_registry.cc:45]   registering LB policy factory for "priority_experimental"
D1229 23:06:09.027405540   12938 lb_policy_registry.cc:45]   registering LB policy factory for "outlier_detection_experimental"
D1229 23:06:09.027414241   12938 lb_policy_registry.cc:45]   registering LB policy factory for "weighted_target_experimental"
D1229 23:06:09.027422457   12938 lb_policy_registry.cc:45]   registering LB policy factory for "pick_first"
D1229 23:06:09.027430746   12938 lb_policy_registry.cc:45]   registering LB policy factory for "round_robin"
D1229 23:06:09.027444142   12938 lb_policy_registry.cc:45]   registering LB policy factory for "ring_hash_experimental"
D1229 23:06:09.027470472   12938 lb_policy_registry.cc:45]   registering LB policy factory for "grpclb"
D1229 23:06:09.027508895   12938 lb_policy_registry.cc:45]   registering LB policy factory for "rls_experimental"
D1229 23:06:09.027531154   12938 lb_policy_registry.cc:45]   registering LB policy factory for "xds_cluster_manager_experimental"
D1229 23:06:09.027539743   12938 lb_policy_registry.cc:45]   registering LB policy factory for "xds_cluster_impl_experimental"
D1229 23:06:09.027548580   12938 lb_policy_registry.cc:45]   registering LB policy factory for "cds_experimental"
D1229 23:06:09.027556928   12938 lb_policy_registry.cc:45]   registering LB policy factory for "xds_cluster_resolver_experimental"
D1229 23:06:09.027565714   12938 certificate_provider_registry.cc:35] registering certificate provider factory for "file_watcher"
I1229 23:06:09.051036639   12938 socket_utils_common_posix.cc:336] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
2022-12-29 23:06:09.090117: I tensorflow/core/tpu/tpu_initializer_helper.cc:225] GetTpuPjrtApi not found
Traceback (most recent call last):
  File "./allstate_claims_data_nn.py", line 145, in <module>
    main()
  File "./allstate_claims_data_nn.py", line 141, in main
    allstate_nn_model = AllStateModelTrainer(args.tpu_name)
  File "./allstate_claims_data_nn.py", line 34, in __init__
    os.environ['TPU_LOAD_LIBRARY'] = 0
  File "/usr/lib/python3.8/os.py", line 680, in __setitem__
    value = self.encodevalue(value)
  File "/usr/lib/python3.8/os.py", line 750, in encode
    raise TypeError("str expected, not %s" % type(value).__name__)
TypeError: str expected, not int
D1229 23:06:12.209423135   12938 init.cc:190]                grpc_shutdown starts clean-up now
messerb5467@t1v-n-3b61e142-w-0:~/kaggle-competitions/allstate-insurance-claims$

Even if I follow the message and make the 0 a string, it continues on to produce a core dump instead of running as one would expect. Any help would be mighty appreciated.

I've tried using a string for TPU_LOAD_LIBRARY instead of the documented integer:

export TPU_LOAD_LIBRARY=0

Use a TPU_NAME of local instead of test-tpu-vm since TPU vms run directly on a TPU host.

Unfortunately when following this, the errors start spinning out of control and I'm not able register with the TPU nodes at all despite the initialization working just fine in colab.

I imagine it has to be something simple and I'm just missing something somewhere.

This code uses a TPU pod to run. So, you would need to follow the instructions given here to create the pod. Note that you need to use the version tpu-vm-tf-2.11.0-pod and not tpu-vm-tf-2.11.0 when creating the pod.

For eg,

gcloud compute tpus tpu-vm create test-tpu-vm \
   --zone=us-central1-a   --accelerator-type=v2-32 \
   --version=tpu-vm-tf-2.11.0-pod

For line 33 we should pass pod name, in your case test-tpu-vm . So, the call to the trainer would be ./allstate_claims_data_nn.py test-tpu-vm .

However, in line 34 it is trying to set the environment variable with an integer. This will not work because this needs to be string when setting environment variable from inside python code in Ubuntu. However, if you set it as a string TPU would throw errors because TPUs need this environment variable as an integer. So, I would recommend skipping the init function and following this and use

export TPU_NAME=test-tpu-vm
export TPU_LOAD_LIBRARY=0
./allstate_claims_data_nn.py test-tpu-vm

(or modify the code to skip taking tpu name as an argument)

This will get you past the TPU setup errors, there will be more code logic error from line 73 which you can continue to work on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM