简体   繁体   English

处理 Google Cloud 存储中的文件

[英]Process a file in Google Cloud storage

I have some very large files (100GB) in GCS that need to be processed to remove invalid characters.我在 GCS 中有一些非常大的文件(100GB)需要处理以删除无效字符。 Downloading them and processing them and uploading them again takes forever.下载并处理它们并再次上传它们需要很长时间。 Does anyone know if it is possible to process them in the Google Cloud Platform eliminating the need for download/upload?有谁知道是否可以在 Google Cloud Platform 中处理它们而无需下载/上传?

I am familiar with Python and Cloud functions if those are an option.如果可以选择的话,我熟悉 Python 和云功能。

As John Hanley said in the comments section, there is no compute features on Cloud Storage, so to process it you need to download it.正如 John Hanley 在评论部分所说,Cloud Storage 上没有计算功能,所以要处理它,您需要下载它。

Once that said, instead of downloading the huge file locally to process it, you can start a Compute Engine VM, download that file, process it with a Python script (since you have stated that you're familiar with Python), and updated the processed file.话虽如此,您可以启动 Compute Engine 虚拟机,下载该文件,使用 Python 脚本(因为您已声明您熟悉 Python)处理它,而不是在本地下载大文件来处理它,然后更新处理的文件。

It will be probably quicker to download the file on a Compute Engine VM (it depends on the machine type though) than downloading the file on your computer.在 Compute Engine VM 上下载文件可能比在您的计算机上下载文件更快(这取决于机器类型)。

Also, for faster downloads of huge files, you can use some gsutil options:此外,为了更快地下载大文件,您可以使用一些gsutil选项:

gsutil \
    -o 'GSUtil:parallel_thread_count=1' \
    -o 'GSUtil:sliced_object_download_max_components=16' \
    cp gs://my-bucket/my-huge-file .

And for faster uploads of huge files, you can use parallel composite uploads:为了更快地上传大文件,您可以使用并行复合上传:

gsutil \
    -o 'GSUtil:parallel_composite_upload_threshold=150M' \
    cp my-huge-file gs://my-bucket

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将文件上传到 Python 3 上的 Google Cloud Storage? - How to upload a file to Google Cloud Storage on Python 3? python和谷歌云存储 - python and google cloud storage 有没有办法让谷歌云存储客户端指向云存储上的“文件对象”,然后由 lxml 使用? - Is there a way for google cloud storage client to point to 'file object' on cloud storage to be then used by lxml? 使用 Pandas 读取和写入 csv 和其他文件格式到 Google Cloud Storage - Read and Write csv and other file formats to Google Cloud Storage with Pandas 将文件从 /tmp 文件夹移动到 Google Cloud Storage 存储桶 - Move file from /tmp folder to Google Cloud Storage bucket 在Google Cloud Storage上的PDF文件上使用textract - Use textract on PDF file located on Google Cloud Storage 如何将 html 字符串作为 pdf 文件上传到 Google Cloud Storage? (Python) - How to upload an html string as pdf file to Google Cloud Storage ? (Python) 使用Python从Google云端存储下载多个文件 - Download multiple file from Google cloud storage using Python 使用 Python 将文件上传到 Google Cloud Storage Bucket 子目录 - Upload File to Google Cloud Storage Bucket Sub Directory using Python 使用tensorflow从谷歌云存储中加载.npy文件 - load .npy file from google cloud storage with tensorflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM