简体   繁体   English

PyTorch的云存储桶

[英]Cloud Storage Buckets for PyTorch

For a particular task I'm working on I have a dataset that is about 25 GB. 对于我正在处理的特定任务,我有大约25 GB的数据集。 I'm still experimenting with several methods of preprocessing and definitely don't have my data to it's final form yet. 我仍在尝试几种预处理方法,并且肯定还没有将其数据保存为最终形式。 I'm not sure what the common workflow is for this sort of problem, so here is what I'm thinking: 我不确定这类问题的通用工作流程是什么,所以这就是我的想法:

  1. Copy dataset from bucket storage to Compute Engine machine SSD (maybe use around 50 GB SSD) using gcsfuse. 使用gcsfuse将数据集从存储桶存储复制到Compute Engine计算机SSD(可能使用大约50 GB SSD)。
  2. Apply various preprocessing operations as an experiment. 应用各种预处理操作作为实验。
  3. Run training with PyTorch on the data stored on the local disk (SSD) 使用PyTorch对存储在本地磁盘(SSD)上的数据进行训练
  4. Copy newly processed data back to storage bucket with gcsfuse if it was successful. 如果成功,则使用gcsfuse将新处理的数据复制回存储桶。
  5. Upload results and delete the persistent disk that was used during training. 上传结果并删除培训期间使用的永久磁盘。

The alternative approach is this: 替代方法是这样的:

  1. Run the processing operations on the data within the Cloud Bucket itself using the mounted directory with gcsfuse 使用带有gcsfuse的已挂载目录对Cloud Bucket自身中的数据运行处理操作
  2. Run training with PyTorch directly on the mounted gcsfuse Bucket directory, using a compute engine instance with very limited storage. 使用存储空间非常有限的计算引擎实例,直接在已安装的gcsfuse Bucket目录上使用PyTorch进行培训。
  3. Upload Results and Delete Compute Engine Instance. 上传结果并删除Compute Engine实例。

Which of these approaches is suggested? 建议使用哪种方法? Which will incur fewer charges and is used most often when running these kind of operations. 这将产生较少的费用,并且在执行此类操作时最常使用。 Is there a different workflow that I'm not seeing here? 我在这里没有看到其他工作流程吗?

On the billing side, the charges would be the same, as the fuse operations are charged like any other Cloud Storage interface according to the documentation . 在计费方面,费用将是相同的,因为根据文档 ,与其他Cloud Storage接口一样,对保险丝操作进行收费。 In your use case I don't know how you are going to train the data, but if you do more than one operation to files it would be better to have them downloaded, trained locally and then the final result uploaded, which would be 2 object operations. 在您的用例中,我不知道您将如何训练数据,但是如果对文件进行多个操作,最好将它们下载,本地训练然后上传最终结果,即2对象操作。 If you do, for example, more than one change or read to a file during the training, every operation would be an object operation. 例如,如果在培训期间进行了多个更改或读取了文件,则每个操作都将是对象操作。 On the workflow side, the proposed one looks good to me. 在工作流方面,提出的建议对我来说很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM