简体   繁体   中英

No such object error when saving TensorFlow model trained on Google Cloud AI Platform to a Google Cloud Storage Bucket

I am training a model using TensorFlow on Google Cloud's AI Platform and while the training itself proceeds nicely, I am unable to save the finished model in SavedModel format to my cloud storage bucket. I know the bucket is set up properly because at the beginning of training I download my training data from that very same bucket. Here is the code I use to save my model:

SAVE_PATH = os.path.join("gs://", 'machine-learning-ebay', 'job-dir')
linear_model.save(SAVE_PATH)

Where 'machine-learning-ebay' is the storage bucket and 'job-dir' is a folder within that storage bucket.

I receive the following error on the job description page in google cloud:

Traceback (most recent call last):
  [...]
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1219, in save
    file_prefix_tensor, object_graph_tensor, options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1164, in _save_cached_when_graph_building
    save_op = saver.save(file_prefix, options=options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 300, in save
    return save_fn()
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/functional_saver.py", line 287, in save_fn
    sharded_prefixes, file_prefix, delete_old_dirs=True)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 504, in merge_v2_checkpoints
    delete_old_dirs=delete_old_dirs, name=name, ctx=_ctx)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 528, in merge_v2_checkpoints_eager_fallback
    attrs=_attrs, ctx=ctx, name=name)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError: Error executing an HTTP request: HTTP response code 404 with body '{
  "error": {
    "code": 404,
    "message": "No such object: machine-learning-ebay/job-dir/variables/variables_temp/part-00000-of-00001.data-00000-of-00001",
    "errors": [
      {
        "message": "No such object: machine-learning-ebay/job-dir/variables/variables_temp/part-00000-of-00001.data-00000-of-00001",
        "domain": "global",
        "reason": "notFound"
      }
    ]
  }
}

Any help is greatly appreciated; the deadline for this project is today.

Following the code in Google's training example ( https://github.com/GoogleCloudPlatform/cloudml-samples/blob/main/census/tf-keras/trainer/task.py ) and a GitHub issue which said that timestamping the output folders solves problems of overwriting ( https://github.com/kubeflow/pipelines/issues/2171 ), I changed my export code to the following:

current_time = now.strftime("%H.%M.%S")
tf.compat.v1.keras.experimental.export_saved_model(linear_model,'gs://machine-learning-ebay/job-dir/keras-export'+current_time)  

This resolved the training errors I was facing, exporting the model successfully.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM