简体   繁体   English

无法在Jupyter Notebook中使用gcloud ml-engine(或ai-platform)命令将作业提交给f1-micro

[英]fail to submit job to f1-micro with gcloud ml-engine (or ai-platform) command in jupyter notebook

I am trying to submit a google cloud job that trains cnn model for mnist digit. 我正在尝试提交可训练mnist位数的cnn模型的Google云工作。 since I am new to gcp, I want to train this job on f1-micro machines first for practice. 由于我是gcp的新手,所以我想首先在f1-micro机器上进行此练习的练习。 but not successful. 但没有成功。 I have two issues along the way. 一路上我有两个问题。

here's my systems. 这是我的系统。 windows 10, anaconda, jupyter notebook 6, python 3.6, tf 1.13.0. Windows 10,Anaconda,Jupyter Notebook 6,Python 3.6,TF 1.13.0。 at first my model works well without any gcp command involved. 起初,我的模型在不涉及任何gcp命令的情况下运作良好。 Then I packed the files into a module as the gcp course suggested. 然后我按照gcp课程的建议将文件打包到一个模块中。 and use gcloud command for local train. 并在本地火车上使用gcloud命令。 the cell seems stuck and doing nothing until I close and halt the ipynb file. 在关闭并暂停ipynb文件之前,该单元似乎卡住了,什么也不做。 the training started right after it and results are correct as I monitored it on Tensorboard. 培训紧随其后开始,结果是正确的,因为我在Tensorboard上对其进行了监视。 what do I need to do to make it run normally from the cell without closing that notebook? 我需要怎么做才能使其在不关闭笔记本的情况下从单元正常运行? btw I can make it run in a terminal without this issue though. 顺便说一句,我可以让它在终端上运行,但是没有这个问题。

second issue, I then tried to do a submission to google cloud machine. 第二个问题,然后我尝试向Google云机提交内容。 I created a vm instance with f1-micro just to practice since it has a lot of free hours. 我创建了一个带有f1-micro的vm实例以进行练习,因为它有很多空闲时间。 but my command options aren't accepted. 但我的命令选项不被接受。 I tried a couple of format for the machine type. 我尝试了几种机器类型的格式。 i can't set the machine type right. 我无法正确设置机器类型。 and how do I build the connection to the instance I have created? 以及如何建立与已创建实例的连接?

any advice? 有什么建议吗? thanks! 谢谢! codes are here. 代码在这里。

#1.local submission lines


OUTDIR='trained_test'

INPDIR='..\data'
shutil.rmtree(path = OUTDIR, ignore_errors = True) 

!gcloud ai-platform local train \
    --module-name=trainer.task \
    --package-path=trainer \
    -- \
    --output_dir=$OUTDIR \
    --input_dir=$INPDIR \
    --epochs=2 \
    --learning_rate=0.001 \
    --batch_size=100


#2. submit to compute engine

OUTDIR='gs://'+BUCKET+'/digit/train_01'
INPDIR='gs://'+BUCKET+'/digit/data'
JOBNAME='kaggle_digit_01_'+datetime.now().strftime("%Y%m%d_%H%M%S")

!gcloud ai-platform jobs submit training $JOBNAME \
    --region=$REGION \
    --module-name=trainer.task \
    --package-path=trainer \
    --job-dir=$OUTDIR \
    --staging-bucket=gs://$BUCKET \
    --scale-tier=custom \
    --master-machine-type=zones/us-central1-a/machineTypes/f1-micro \
    --runtime-version 1.13 \
    -- \
    --output_dir=OUTDIR \
    --input_dir=INPDIR \
    --epochs=5 --learning_rate=0.001 --batch_size=100 \

Error message: 错误信息:

ERROR: (gcloud.ai-platform.jobs.submit.training) INVALID_ARGUMENT: Field: master_type Error: The specified machine type is not supported: zones/us-central1-a/machineTypes/f1-micro
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: 'The specified machine type is not supported: zones/us-central1-a/machineTypes/f1-micro'
    field: master_type

Update: 更新:

changing the machine type does work 更改机器类型确实可以

--scale-tier=CUSTOM \
--master-machine-type=n1-standard-4 \

I also put the following at the beginning, so the notebook recognize the file format such as $jobname... 我也将以下内容放在开头,因此笔记本可以识别文件格式,例如$ jobname...。

import gcsfs

btw --job-dir doesn't seem to matter. 顺便说一句--job-dir似乎无关紧要。

however the local train still have the same issue that, I need to close and halt the notebook to kick off the training. 但是,本地火车仍然存在相同的问题,我需要关闭并停止笔记本电脑才能开始训练。 could anyone give a suggestion on this? 有人可以对此提出建议吗?

f1-micro is not supported by AI Platform Training. AI平台培训不支持f1-micro。 Here is the list of supported machines. 是受支持机器的列表。 Also you don't need to specify zone. 另外,您无需指定区域。 just the machine type. 只是机器类型。 Ie, --master-machine-type=n1-standard-4 即--master-machine-type = n1-standard-4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM