简体   繁体   English

基于TPU的CloudML调优

[英]TPU Based Tuning for CloudML

Are TPUs supported for distributed hyperparameter search? 分布式超参数搜索是否支持TPU? I'm using the tensor2tensor library, which supports CloudML for hyperparameter search, ie, the following works for me to conduct hyperparameter search for a language model on GPUs: 我正在使用tensor2tensor库,该库支持CloudML进行超参数搜索,即,以下对我有用的工作是在GPU上对语言模型进行超参数搜索:

t2t-trainer \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --problem=languagemodel_lm1b8k_packed \
  --train_steps=100000 \
  --eval_steps=8 \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --cloud_mlengine \
  --hparams_range=transformer_base_range \
  --autotune_objective='metrics-languagemodel_lm1b8k_packed/neg_log_perplexity' \
  --autotune_maximize \
  --autotune_max_trials=100 \
  --autotune_parallel_trials=3

However, when I try to utilize TPUs as in the following: 但是,当我尝试如下使用TPU时:

t2t-trainer \
  --problem=languagemodel_lm1b8k_packed \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --train_steps=100000 \
  --use_tpu=True \
  --cloud_mlengine_master_type=cloud_tpu \
  --cloud_mlengine \
  --hparams_range=transformer_base_range \
  --autotune_objective='metrics-languagemodel_lm1b8k_packed/neg_log_perplexity' \
  --autotune_maximize \
  --autotune_max_trials=100 \
  --autotune_parallel_trials=5

I get the error: 我得到错误:

googleapiclient.errors.HttpError: <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/******/jobs?alt=json returned "Field: master_type Error: The specified machine type for masteris not supported in TPU training jobs: cloud_tpu"

One of the authors of the tensor2tensor library here. tensor2tensor库的作者之一。 Yup, this was indeed a bug and is now fixed . 是的,这确实是一个错误,现已修复 Thanks for spotting. 感谢发现。 We'll release a fixed version on PyPI this week, and you can of course clone and install locally from master until then. 我们将于本周在PyPI上发布固定版本,您当然可以在那时之前从master克隆并在本地安装。

The command you used should work just fine now. 您使用的命令现在应该可以正常工作。

I believe there is a bug in the tensor2tensor library: https://github.com/tensorflow/tensor2tensor/blob/6a7ef7f79f56fdcb1b16ae76d7e61cb09033dc4f/tensor2tensor/utils/cloud_mlengine.py#L281 我相信tensor2tensor库中存在一个错误: https : //github.com/tensorflow/tensor2tensor/blob/6a7ef7f79f56fdcb1b16ae76d7e61cb09033dc4f/tensor2tensor/utils/cloud_mlengine.py#L281

It's the worker_type (and not the master_type) that needs to be set for Cloud ML Engine. 需要为Cloud ML Engine设置worker_type(而不是master_type)。

To answer the original question though, yes, HP Tuning should be supported for TPUs, but the error above is orthogonal to that. 但是,要回答原始问题,可以,TPU应该支持HP Tuning,但是上面的错误与此正交。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM