简体   繁体   English

使用google-cloud-storage将数据从gcs传输到s3

[英]Transfering data from gcs to s3 with google-cloud-storage

I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python. 我正在制作一个小型应用程序,用于将BigQuery中的数据导出到google-cloud-storage,然后将其复制到AWS s3中,但是却找不到如何在python中进行操作的麻烦。

I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata) . 我已经用kotlin编写了代码(因为对我来说这是最简单的,并且出于我的问题范围之外的原因,我们希望它在python中运行),而在kotlin中, google sdk允许我从Blob对象获取InputSteam ,然后可以将其注入到amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata)

With the python sdk it seems i only have the options to download file to a file and as a string. 使用python sdk ,似乎我只能选择将文件下载到文件中并作为字符串下载。

I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first. 我想(就像我在kotlin中所做的那样)将Blob对象返回的一些对象传递到AmazonS3.putObject()方法中,而不必先将内容另存为文件。

I am in no way a python pro, so i might have missed an obvious way of doing this. 我绝对不是python pro,所以我可能想念这样做的明显方法。

I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle. 我最终得到了以下解决方案,因为download_to_filename显然download_to_filename数据下载到boto3 s3 client可以处理的类似文件的对象中。

This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files. 这对于较小的文件来说效果很好,但是由于将它们全部缓存在内存中,因此对于较大的文件可能会出现问题。

def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")

bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)

data = BytesIO()
blob.download_to_file(data)
data.seek(0)

s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)

If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated. 如果任何人都具有有关BytesIO之外的其他信息/知识来处理数据(FX。这样我就可以将数据直接流送到s3,而不必将其缓存在主机的内存中),将不胜感激。

Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. 可以使用Google可恢复媒体从GCS和smart_open通过块下载文件,以将它们上传到S3。 This way you don't need to download whole file into memory. 这样,您无需将整个文件下载到内存中。 Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file? 还有一个类似的问题可以解决此问题, 您可以使用流而不是本地文件将其上传到S3吗?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 什么相当于使用 s3fs 在 aws s3 中连接到谷歌云存储(gcs)? - What is the equivalent of connecting to google cloud storage(gcs) like in aws s3 using s3fs? Google-cloud-storage中是否有与Python相对应的`refFromUrl`吗? - Is there an equivalent to `refFromUrl` in google-cloud-storage for Python? Django 2:使用 google-cloud-storage 将媒体上传到 Google Cloud Storage - Django 2: upload media to Google Cloud Storage with google-cloud-storage 通过无服务器方式的云功能将数据从Google云存储移动到Amazon s3 - Moving data from google cloud storage to Amazon s3 via cloud function in a serverless fashion 通过静态 IP 将文件从 FTP 传输到 Google Cloud Storage - Transfering File from FTP to Google Cloud Storage via a Static IP 如何使用云函数调用 gsutil 或使用 GCS 对象的路径将数据从 GCS 移动到 s3 存储桶 - How to invoke gsutil or use path of GCS objects to move data from GCS to s3 bucket using cloud function 从 Google Cloud Storage 存储桶复制到 S3 存储桶 - Copy from Google Cloud Storage Bucket to S3 Bucket 使用 google-cloud-storage Python 客户端获取下载标头? - Get download headers with google-cloud-storage Python client? 得到了一个意外的关键字参数“超时”(Python 中的 google-cloud-storage) - Got an unexpected keyword argument 'timeout' (google-cloud-storage in Python) 使用 Workload Identity Federation 对 Python 中的 Google-Cloud-Storage 进行身份验证 - Authenticate Google-Cloud-Storage in Python using Workload Identity Federation
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM