简体繁体 English

将数据从S3迁移到Google云存储

[英]Migrating data from S3 to Google cloud storage

原文 2013-06-06 18:22:07 9 2 amazon-s3/ google-cloud-storage/ gsutil

I need to move a large amount of files (on the order of tens of terabytes) from Amazon S3 into Google Cloud Storage. 我需要将大量文件（大约数十兆字节）从Amazon S3移动到Google Cloud Storage。 The files in S3 are all under 500mb. S3中的文件均小于500mb。

So far I have tried using gsutil cp with the parallel option (-m) to using S3 as source and GS as destination directly. 到目前为止，我已经尝试将gsutil cp与并行选项（-m）结合使用，以直接将S3用作源，将GS直接用作目标。 Even tweaking the multi-processing and multi-threading parameters I haven't been able to achieve a performance of over 30mb/s. 即使调整多处理和多线程参数，我也无法实现超过30mb / s的性能。

What I am now contemplating: 我现在正在考虑的是：

Load the data in batches from S3 into hdfs using distcp and then finding a way of distcp-ing all the data into google storage (not supported as far as I can tell), or: 使用distcp将数据从S3批量加载到hdfs，然后找到一种将所有数据distcp-分配到google存储中的方法（据我所知不支持），或者：
Set up a hadoop cluster where each node runs a gsutil cp parallel job with S3 and GS as src and dst 设置一个hadoop集群，其中每个节点运行一个gsutil cp并行作业，并将S3和GS分别作为src和dst

If the first option were supported, I would really appreciate details on how to do that. 如果支持第一个选项，我将非常感谢有关如何执行此操作的详细信息。 However, it seems like I'm gonna have to find out how to do the second one. 但是，似乎我必须找出如何做第二个。 I'm unsure of how to pursue this avenue because I would need to keep track of the gsutil resumable transfer feature on many nodes and I'm generally inexperienced running this sort of hadoop job. 我不确定如何走这条路，因为我需要跟踪许多节点上的gsutil可恢复传输功能，而且我通常没有经验来运行这种hadoop工作。

Any help on how to pursue one of these avenues (or something simpler I haven't thought of) would be greatly appreciated. 我们将不胜感激如何采用这些途径之一（或者我从未想到过的简单方法）。

2 个解决方案

You could set up a Google Compute Engine (GCE) account and run gsutil from GCE to import the data. 您可以设置一个Google Compute Engine （GCE）帐户，然后从GCE运行gsutil来导入数据。 You can start up multiple GCE instances, each importing a subset of the data. 您可以启动多个GCE实例，每个实例都导入数据的子集。 That's part of one of the techniques covered in the talk we gave at Google I/O 2013 called Importing Large Data Sets into Google Cloud Storage . 这是我们在Google I / O 2013上的演讲中涵盖的技术之一，该演讲称为“ 将大数据集导入Google Cloud Storage” 。

One other thing you'll want to do if you use this approach is to use the gsutil cp -L and -n options. 如果使用这种方法，您gsutil cp -L要做的另一件事是使用gsutil cp -L和-n选项。 -L creates a manifest that records details about what has been transferred, and -n allows you to avoid re-copying files that were already copied (in case you restart the copy from the beginning, eg, after an interruption). -L创建一个清单，该清单记录有关已传输内容的详细信息，并且-n允许您避免重新复制已复制的文件（以防您从头开始复制，例如在中断之后）。 I suggest you update to gsutil version 3.30 (which will come out in the next week or so), which improves how the -L option works for this kind of copying scenario. 我建议您更新到gsutil版本3.30（将在下周左右发布），这将改善-L选项在这种复制情况下的工作方式。

Mike Schwartz, Google Cloud Storage team Google云存储团队的Mike Schwartz

Google has recently released the Cloud Storage Transfer Service which is designed to transfer large amounts of data from S3 to GCS: https://cloud.google.com/storage/transfer/getting-started Google最近发布了Cloud Storage Transfer Service，该服务旨在将大量数据从S3传输到GCS： https ： //cloud.google.com/storage/transfer/getting-started

(I realize this answer is a little late for the original question but it may help future visitors with the same question.) （我知道这个答案对于原始问题来说有点晚了，但是它可能会帮助以后遇到相同问题的访客。）