简体   繁体   中英

Cloud Storage Buckets for PyTorch

For a particular task I'm working on I have a dataset that is about 25 GB. I'm still experimenting with several methods of preprocessing and definitely don't have my data to it's final form yet. I'm not sure what the common workflow is for this sort of problem, so here is what I'm thinking:

  1. Copy dataset from bucket storage to Compute Engine machine SSD (maybe use around 50 GB SSD) using gcsfuse.
  2. Apply various preprocessing operations as an experiment.
  3. Run training with PyTorch on the data stored on the local disk (SSD)
  4. Copy newly processed data back to storage bucket with gcsfuse if it was successful.
  5. Upload results and delete the persistent disk that was used during training.

The alternative approach is this:

  1. Run the processing operations on the data within the Cloud Bucket itself using the mounted directory with gcsfuse
  2. Run training with PyTorch directly on the mounted gcsfuse Bucket directory, using a compute engine instance with very limited storage.
  3. Upload Results and Delete Compute Engine Instance.

Which of these approaches is suggested? Which will incur fewer charges and is used most often when running these kind of operations. Is there a different workflow that I'm not seeing here?

On the billing side, the charges would be the same, as the fuse operations are charged like any other Cloud Storage interface according to the documentation . In your use case I don't know how you are going to train the data, but if you do more than one operation to files it would be better to have them downloaded, trained locally and then the final result uploaded, which would be 2 object operations. If you do, for example, more than one change or read to a file during the training, every operation would be an object operation. On the workflow side, the proposed one looks good to me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM