[英]Process a file in Google Cloud storage
I have some very large files (100GB) in GCS that need to be processed to remove invalid characters.我在 GCS 中有一些非常大的文件(100GB)需要处理以删除无效字符。 Downloading them and processing them and uploading them again takes forever.
下载并处理它们并再次上传它们需要很长时间。 Does anyone know if it is possible to process them in the Google Cloud Platform eliminating the need for download/upload?
有谁知道是否可以在 Google Cloud Platform 中处理它们而无需下载/上传?
I am familiar with Python and Cloud functions if those are an option.如果可以选择的话,我熟悉 Python 和云功能。
As John Hanley said in the comments section, there is no compute features on Cloud Storage, so to process it you need to download it.正如 John Hanley 在评论部分所说,Cloud Storage 上没有计算功能,所以要处理它,您需要下载它。
Once that said, instead of downloading the huge file locally to process it, you can start a Compute Engine VM, download that file, process it with a Python script (since you have stated that you're familiar with Python), and updated the processed file.话虽如此,您可以启动 Compute Engine 虚拟机,下载该文件,使用 Python 脚本(因为您已声明您熟悉 Python)处理它,而不是在本地下载大文件来处理它,然后更新处理的文件。
It will be probably quicker to download the file on a Compute Engine VM (it depends on the machine type though) than downloading the file on your computer.在 Compute Engine VM 上下载文件可能比在您的计算机上下载文件更快(这取决于机器类型)。
Also, for faster downloads of huge files, you can use some gsutil
options:此外,为了更快地下载大文件,您可以使用一些
gsutil
选项:
gsutil \
-o 'GSUtil:parallel_thread_count=1' \
-o 'GSUtil:sliced_object_download_max_components=16' \
cp gs://my-bucket/my-huge-file .
And for faster uploads of huge files, you can use parallel composite uploads:为了更快地上传大文件,您可以使用并行复合上传:
gsutil \
-o 'GSUtil:parallel_composite_upload_threshold=150M' \
cp my-huge-file gs://my-bucket
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.