[英]Exporting data from Google Cloud Storage to Amazon S3
I would like to transfer data from a table in BigQuery, into another one in Redshift.我想将数据从 BigQuery 中的一个表传输到 Redshift 中的另一个表中。 My planned data flow is as follows:我计划的数据流如下:
BigQuery -> Google Cloud Storage -> Amazon S3 -> Redshift BigQuery -> Google Cloud Storage -> Amazon S3 -> Redshift
I know about Google Cloud Storage Transfer Service, but I'm not sure it can help me.我知道 Google Cloud Storage Transfer Service,但我不确定它能帮到我。 From Google Cloud documentation:来自谷歌云文档:
Cloud Storage Transfer Service云存储传输服务
This page describes Cloud Storage Transfer Service, which you can use to quickly import online data into Google Cloud Storage.本页介绍 Cloud Storage Transfer Service,您可以使用它快速将在线数据导入 Google Cloud Storage。
I understand that this service can be used to import data into Google Cloud Storage and not to export from it.我了解此服务可用于将数据导入 Google Cloud Storage 而不能从中导出。
Is there a way I can export data from Google Cloud Storage to Amazon S3?有什么方法可以将数据从 Google Cloud Storage 导出到 Amazon S3?
You can use gsutil to copy data from a Google Cloud Storage bucket to an Amazon bucket, using a command such as:您可以使用 gsutil 将数据从 Google Cloud Storage 存储桶复制到 Amazon 存储桶,使用的命令如下:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
Note that the -d option above will cause gsutil rsync to delete objects from your S3 bucket that aren't present in your GCS bucket (in addition to adding new objects).请注意,上面的 -d 选项将导致 gsutil rsync 从您的 S3 存储桶中删除 GCS 存储桶中不存在的对象(除了添加新对象)。 You can leave off that option if you just want to add new objects from your GCS to your S3 bucket.如果您只想将新对象从 GCS 添加到 S3 存储桶,则可以不使用该选项。
Go to any instance or cloud shell in GCP转到 GCP 中的任何实例或云外壳
First of all configure your AWS credentials in your GCP首先在您的 GCP 中配置您的 AWS 凭证
aws configure
if this is not recognising the install AWS CLI follow this guide https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html如果这不能识别安装 AWS CLI,请按照本指南https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html
follow this URL for AWS configure https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html按照此 URL 进行 AWS 配置https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Attaching my screenshot附上我的截图
Then using gsutil
然后使用gsutil
gsutil -m rsync -rd gs://storagename s3://bucketname
16GB data transferred in some minutes几分钟内传输了 16GB 数据
Using Rclone ( https://rclone.org/ ).使用 Rclone ( https://rclone.org/ )。
Rclone is a command line program to sync files and directories to and from Rclone 是一个命令行程序,用于同步文件和目录
Google Drive
Amazon S3
Openstack Swift / Rackspace cloud files / Memset Memstore
Dropbox
Google Cloud Storage
Amazon Drive
Microsoft OneDrive
Hubic
Backblaze B2
Yandex Disk
SFTP
The local filesystem
I needed to transfer 2TB of data from Google Cloud Storage bucket to Amazon S3 bucket.我需要将 2TB 的数据从 Google Cloud Storage 存储桶传输到 Amazon S3 存储桶。 For the task, I created the Google Compute Engine of V8CPU (30 GB).为了完成这项任务,我创建了 V8CPU (30 GB) 的Google Compute Engine 。
Allow Login using SSH on the Compute Engine.允许在 Compute Engine 上使用 SSH 登录。 Once logedin create and empty .boto configuration file to add AWS credential information.登录后,创建并清空.boto 配置文件以添加 AWS 凭证信息。 Added AWS credentials by taking the reference from the mentioned link.通过从提到的链接中获取参考,添加了 AWS 凭证。
Then run the command:然后运行命令:
gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket gsutil -m rsync -rd gs://your-gcs-bucket s3://your-s3-bucket
The data transfer rate is ~1GB/s.数据传输速率约为 1GB/s。
Hope this help.希望这有帮助。 (Do not forget to terminate the compute instance once the job is done) (不要忘记在工作完成后终止计算实例)
Using the gsutil
tool we can do a wide range of bucket and object management tasks, including:使用gsutil
工具,我们可以执行广泛的存储桶和对象管理任务,包括:
we can copy data from a Google Cloud Storage bucket to an amazon s3 bucket using gsutil rsync
and gsutil cp
operations.我们可以使用gsutil rsync
和gsutil cp
操作将数据从 Google Cloud Storage 存储桶复制到亚马逊 s3 存储桶。 whereas而
gsutil rsync
collects all metadata from the bucket and syncs the data to s3 gsutil rsync
从存储桶中收集所有元数据并将数据同步到 s3
gsutil -m rsync -r gs://your-gcs-bucket s3://your-s3-bucket
gsutil cp
copies the files one by one and as the transfer rate is good it copies 1 GB in 1 minute approximately. gsutil cp
一个一个地复制文件,由于传输速度很好,它大约在 1 分钟内复制 1 GB。
gsutil cp gs://<gcs-bucket> s3://<s3-bucket-name>
if you have a large number of files with high data volume then use this bash script and run it in the background with multiple threads using the screen
command in amazon or GCP instance with AWS credentials configured and GCP auth verified.如果您有大量具有高数据量的文件,则使用此 bash 脚本并在后台以多线程运行它,使用 amazon 或 GCP 实例中的screen
命令,配置 AWS 凭证并验证 GCP 身份验证。
Before running the script list all the files and redirect to a file and read the file as input in the script to copy the file在运行脚本之前列出所有文件并重定向到一个文件并读取该文件作为脚本中的输入以复制该文件
gsutil ls gs://<gcs-bucket> > file_list_part.out
Bash script: bash脚本:
#!/bin/bash
echo "start processing"
input="file_list_part.out"
while IFS= read -r line
do
command="gsutil cp ${line} s3://<bucket-name>"
echo "command :: $command :: $now"
eval $command
retVal=$?
if [ $retVal -ne 0 ]; then
echo "Error copying file"
exit 1
fi
echo "Copy completed successfully"
done < "$input"
echo "completed processing"
execute the Bash script and write the output to a log file to check the progress of completed and failed files.执行 Bash 脚本并将输出写入日志文件以检查已完成和失败文件的进度。
bash file_copy.sh > /root/logs/file_copy.log 2>&1
For large amounts of large files (100MB+) you might get issues with broken pipes and other annoyances, probably due to multipart upload requirement (as Pathead mentioned).对于大量大文件 (100MB+),您可能会遇到管道损坏和其他烦恼的问题,这可能是由于分段上传要求(如Pathead提到的)。
For that case you're left with simple downloading all files to your machine and uploading them back.在这种情况下,您只需将所有文件下载到您的机器上,然后再上传回来。 Depending on your connection and data amount, it might be more effective to create VM instance to utilize high-speed connection and ability to run it in the background on different machine than yours.根据您的连接和数据量,创建 VM 实例以利用高速连接以及在与您不同的机器上在后台运行它的能力可能更有效。
Create VM machine (make sure the service account has access to your buckets), connect via SSH and install AWS CLI ( apt install awscli
) and configure the access to S3 ( aws configure
).创建 VM 机器(确保服务帐户可以访问您的存储桶),通过 SSH 连接并安装 AWS CLI( apt install awscli
)并配置对 S3 的访问( aws configure
)。
Run these two lines, or make it a bash script, if you have many buckets to copy.运行这两行,或者把它变成一个 bash 脚本,如果你有很多桶要复制的话。
gsutil -m cp -r "gs://$1" ./
aws s3 cp --recursive "./$1" "s3://$1"
(It's better to use rsync
in general, but cp
was faster for me) (一般情况下最好使用rsync
,但cp
对我来说更快)
Tools like gsutil
and aws s3 cp
won't use multipart uploads/downloads, so will have poor performance for large files . gsutil
和aws s3 cp
等工具不会使用分段上传/下载,因此对大文件的性能会很差。
Skyplane is a much faster alternative for transferring data between clouds (up to 110x for large files). Skyplane是一种在云之间传输数据的更快替代方案(大文件最高可达 110 倍)。 You can transfer data with the command:您可以使用以下命令传输数据:
skyplane cp -r s3://aws-bucket-name/ gcs://google-bucket-name/
(disclaimer: I am a contributor) (免责声明:我是贡献者)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.