[英]Transfering data from gcs to s3 with google-cloud-storage
I'm making a small app to export data from BigQuery to google-cloud-storage and then copy it into aws s3, but having trouble finding out how to do it in python. 我正在制作一个小型应用程序,用于将BigQuery中的数据导出到google-cloud-storage,然后将其复制到AWS s3中,但是却找不到如何在python中进行操作的麻烦。
I have already written the code in kotlin (because it was easiest for me, and reasons outside the scope of my question, we want it to run in python), and in kotlin the google sdk
allows me to get an InputSteam
from the Blob
object, which i can then inject into the amazon s3 sdk's
AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata)
. 我已经用kotlin编写了代码(因为对我来说这是最简单的,并且出于我的问题范围之外的原因,我们希望它在python中运行),而在kotlin中,
google sdk
允许我从Blob
对象获取InputSteam
,然后可以将其注入到amazon s3 sdk's
AmazonS3.putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata)
。
With the python sdk
it seems i only have the options to download file to a file and as a string. 使用
python sdk
,似乎我只能选择将文件下载到文件中并作为字符串下载。
I would like (as i do in kotlin) to pass some object returned from the Blob
object, into the AmazonS3.putObject()
method, without having to save the content as a file first. 我想(就像我在kotlin中所做的那样)将
Blob
对象返回的一些对象传递到AmazonS3.putObject()
方法中,而不必先将内容另存为文件。
I am in no way a python pro, so i might have missed an obvious way of doing this. 我绝对不是python pro,所以我可能想念这样做的明显方法。
I ended up with the following solution, as apparently download_to_filename
downloads data into a file-like-object that the boto3 s3 client
can handle. 我最终得到了以下解决方案,因为
download_to_filename
显然download_to_filename
数据下载到boto3 s3 client
可以处理的类似文件的对象中。
This works just fine for smaller files, but as it buffers it all in memory, it could be problematic for larger files. 这对于较小的文件来说效果很好,但是由于将它们全部缓存在内存中,因此对于较大的文件可能会出现问题。
def copy_data_from_gcs_to_s3(gcs_bucket, gcs_filename, s3_bucket, s3_filename):
gcs_client = storage.Client(project="my-project")
bucket = gcs_client.get_bucket(gcs_bucket)
blob = bucket.blob(gcs_filename)
data = BytesIO()
blob.download_to_file(data)
data.seek(0)
s3 = boto3.client("s3")
s3.upload_fileobj(data, s3_bucket, s3_filename)
If anyone has information/knowledge about something other than BytesIO to handle the data (fx. so i can stream the data directly into s3, without having to buffer it in memory on the host-machine) it would be very much appreciated. 如果任何人都具有有关BytesIO之外的其他信息/知识来处理数据(FX。这样我就可以将数据直接流送到s3,而不必将其缓存在主机的内存中),将不胜感激。
Google-resumable-media can be used to download file through chunks from GCS and smart_open to upload them to S3. 可以使用Google可恢复媒体从GCS和smart_open通过块下载文件,以将它们上传到S3。 This way you don't need to download whole file into memory.
这样,您无需将整个文件下载到内存中。 Also there is an similar question that addresses this issue Can you upload to S3 using a stream rather than a local file?
还有一个类似的问题可以解决此问题, 您可以使用流而不是本地文件将其上传到S3吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.