使用google-cloud-storage将数据从gcs传输到s3

Question

I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python. 我正在制作一个小型应用程序，用于将BigQuery中的数据导出到google-cloud-storage，然后将其复制到AWS s3中，但是却找不到如何在python中进行操作的麻烦。

I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk allows me to get an InputSteam from the Blob object, which i can then inject into the amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata) . 我已经用kotlin编写了代码（因为对我来说这是最简单的，并且出于我的问题范围之外的原因，我们希望它在python中运行），而在kotlin中， google sdk允许我从Blob对象获取InputSteam ，然后可以将其注入到amazon s3 sdk's AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata) 。

With the python sdk it seems i only have the options to download file to a file and as a string. 使用python sdk ，似乎我只能选择将文件下载到文件中并作为字符串下载。

I would like (as i do in kotlin) to pass some object returned from the Blob object, into the AmazonS3.putObject() method, without having to save the content as a file first. 我想（就像我在kotlin中所做的那样）将Blob对象返回的一些对象传递到AmazonS3.putObject()方法中，而不必先将内容另存为文件。

I am in no way a python pro, so i might have missed an obvious way of doing this. 我绝对不是python pro，所以我可能想念这样做的明显方法。

Answer 1

I ended up with the following solution, as apparently download_to_filename downloads data into a file-like-object that the boto3 s3 client can handle. 我最终得到了以下解决方案，因为download_to_filename显然download_to_filename数据下载到boto3 s3 client可以处理的类似文件的对象中。

This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files. 这对于较小的文件来说效果很好，但是由于将它们全部缓存在内存中，因此对于较大的文件可能会出现问题。

def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")

bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)

data = BytesIO()
blob.download_to_file(data)
data.seek(0)

s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)

If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated. 如果任何人都具有有关BytesIO之外的其他信息/知识来处理数据（FX。这样我就可以将数据直接流送到s3，而不必将其缓存在主机的内存中），将不胜感激。

Answer 2

Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. 可以使用Google可恢复媒体从GCS和smart_open通过块下载文件，以将它们上传到S3。 This way you don't need to download whole file into memory. 这样，您无需将整个文件下载到内存中。 Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file? 还有一个类似的问题可以解决此问题，您可以使用流而不是本地文件将其上传到S3吗？

使用google-cloud-storage将数据从gcs传输到s3

问题描述

2 个解决方案

解决方案1
1 2017-10-24 08:43:39

解决方案2
1 2018-02-01 08:55:56

使用google-cloud-storage将数据从gcs传输到s3

问题描述

2 个解决方案

解决方案1 1 2017-10-24 08:43:39

解决方案2 1 2018-02-01 08:55:56

解决方案1
1 2017-10-24 08:43:39

解决方案2
1 2018-02-01 08:55:56