简体   繁体   English

将 AWS Dynamodb 备份到 S3

[英]Backup AWS Dynamodb to S3

It has been suggested on Amazon docs http://aws.amazon.com/dynamodb/ among other places, that you can backup your dynamodb tables using Elastic Map Reduce,Amazon 文档http://aws.amazon.com/dynamodb/以及其他地方建议您可以使用 Elastic Map Reduce 备份您的 dynamodb 表,
I have a general understanding of how this could work but I couldn't find any guides or tutorials on this,我对它的工作原理有一个大致的了解,但是我找不到任何关于它的指南或教程,

So my question is how can I automate dynamodb backups (using EMR)?所以我的问题是如何自动化 dynamodb 备份(使用 EMR)?

So far, I think I need to create a "streaming" job with a map function that reads the data from dynamodb and a reduce that writes it to S3 and I believe these could be written in Python (or java or a few other languages).到目前为止,我认为我需要创建一个“流式”作业,其中包含一个从 dynamodb 读取数据的 map 函数和一个将数据写入 S3 的 reduce,我相信这些可以用 Python(或 java 或其他一些语言)编写.

Any comments, clarifications, code samples, corrections are appreciated.任何评论、澄清、代码示例、更正都将受到赞赏。

With introduction of AWS Data Pipeline, with a ready made template for dynamodb to S3 backup, the easiest way is to schedule a back up in the Data Pipeline [link] ,随着 AWS Data Pipeline 的引入,使用现成的 dynamodb 到 S3 备份模板,最简单的方法是在 Data Pipeline [链接]中安排备份,

In case you have special needs (data transformation, very fine grain control...) consider the answer by @greg如果您有特殊需求(数据转换、非常精细的粒度控制...),请考虑@greg 的回答

There are some good guides for working with MapReduce and DynamoDB.有一些关于使用 MapReduce 和 DynamoDB 的很好的指南。 I followed this one the other day and got data exporting to S3 going reasonably painlessly.前几天我关注了这个,并且相当轻松地将数据导出到 S3。 I think your best bet would be to create a hive script that performs the backup task, save it in an S3 bucket, then use the AWS API for your language to pragmatically spin up a new EMR job flow, complete the backup.我认为您最好的选择是创建一个执行备份任务的配置单元脚本,将其保存在 S3 存储桶中,然后使用适用于您的语言的 AWS API 实用地启动新的 EMR 作业流程,完成备份。 You could set this as a cron job.您可以将其设置为 cron 作业。

Example of a hive script exporting data from Dynamo to S3:将数据从 Dynamo 导出到 S3 的配置单元脚本示例:

CREATE EXTERNAL TABLE my_table_dynamodb (
    company_id string
    ,id string
    ,name string
    ,city string
    ,state string
    ,postal_code string)
 STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
 TBLPROPERTIES ("dynamodb.table.name"="my_table","dynamodb.column.mapping" = "id:id,name:name,city:city,state:state,postal_code:postal_code");

CREATE EXTERNAL TABLE my_table_s3 (
    ,id string
    ,name string
    ,city string
    ,state string
    ,postal_code string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 's3://yourBucket/backup_path/dynamo/my_table';

 INSERT OVERWRITE TABLE my_table_s3
 SELECT * from my_table_dynamodb;

Here is an example of a PHP script that will spin up a new EMR job flow:以下是将启动新 EMR 作业流程的 PHP 脚本示例:

$emr = new AmazonEMR();

$response = $emr->run_job_flow(
            'My Test Job',
            array(
                "TerminationProtected" => "false",
                "HadoopVersion" => "0.20.205",
                "Ec2KeyName" => "my-key",
                "KeepJobFlowAliveWhenNoSteps" => "false",
                "InstanceGroups" => array(
                    array(
                        "Name" => "Master Instance Group",
                        "Market" => "ON_DEMAND",
                        "InstanceType" => "m1.small",
                        "InstanceCount" => 1,
                        "InstanceRole" => "MASTER",
                    ),
                    array(
                        "Name" => "Core Instance Group",
                        "Market" => "ON_DEMAND",
                        "InstanceType" => "m1.small",
                        "InstanceCount" => 1,
                        "InstanceRole" => "CORE",
                    ),
                ),
            ),
            array(
                "Name" => "My Test Job",
                "AmiVersion" => "latest",
                "Steps" => array(
                    array(
                        "HadoopJarStep" => array(
                            "Args" => array(
                                "s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
                                "--base-path",
                                "s3://us-east-1.elasticmapreduce/libs/hive/",
                                "--install-hive",
                                "--hive-versions",
                                "0.7.1.3",
                            ),
                            "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
                        ),
                        "Name" => "Setup Hive",
                        "ActionOnFailure" => "TERMINATE_JOB_FLOW",
                    ),
                    array(
                        "HadoopJarStep" => array(
                            "Args" => array(
                                "s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
                                "--base-path",
                                "s3://us-east-1.elasticmapreduce/libs/hive/",
                                "--hive-versions",
                                "0.7.1.3",
                                "--run-hive-script",
                                "--args",
                                "-f",
                                "s3n://myBucket/hive_scripts/hive_script.hql",
                                "-d",
                                "INPUT=Var_Value1",
                                "-d",
                                "LIB=Var_Value2",
                                "-d",
                                "OUTPUT=Var_Value3",
                            ),
                            "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
                        ),
                        "Name" => "Run Hive Script",
                        "ActionOnFailure" => "CANCEL_AND_WAIT",
                    ),
                ),
                "LogUri" => "s3n://myBucket/logs",
            )
        );

}

AWS Data Pipeline is costly and the complexity of managing a templated process cannot compare to a simplicity of a CLI command you can make changes to and run on a schedule (using cron , Teamcity or your CI tool of choice) AWS Data Pipeline 成本高昂,管理模板化流程的复杂性无法与 CLI 命令的简单性相比,您可以更改并按计划运行(使用cron 、Teamcity 或您选择的 CI 工具)

Amazon promotes Data Pipeline as they make a profit on it. Amazon 推广 Data Pipeline,因为他们从中获利。 I'd say that it only really makes sense if you have a very large database (>3GB), as the performance improvement will justify it.我想说只有当你有一个非常大的数据库(>3GB)时它才真正有意义,因为性能改进将证明它是合理的。

For small and medium databases (1GB or less) I'd recommend you use one of the many tools available, all three below can handle backup and restore processes from the command line:对于中小型数据库(1GB 或更小),我建议您使用众多可用工具之一,以下所有三个工具都可以从命令行处理备份和恢复过程:

Bear in mind that due to bandwidth/latency issues these will always perform better from an EC2 instance than your local network.请记住,由于带宽/延迟问题,这些在 EC2 实例中的性能总是比您的本地网络更好。

With the introduction of DynamoDB Streams and Lambda - you should be able to take backups and incremental backups of your DynamoDB data.随着 DynamoDB Streams 和 Lambda 的引入——您应该能够对 DynamoDB 数据进行备份和增量备份。

You can associate your DynamoDB Stream with a Lambda Function to automatically trigger code for every data update (Ie: data to another store like S3)您可以将 DynamoDB Stream 与 Lambda 函数相关联,以自动触发每次数据更新的代码(即:数据到另一个存储,如 S3)

A lambda function you can use to tie up with DynamoDb for incremental backups:您可以使用 lambda 函数与 DynamoDb 结合以进行增量备份:

https://github.com/PageUpPeopleOrg/dynamodb-replicator https://github.com/PageUpPeopleOrg/dynamodb-replicator

I've provided a detailed explanation how you can use DynamoDB Streams, Lambda and S3 versioned buckets to create incremental backups for your data in DynamoDb on my blog:我在我的博客上详细解释了如何使用 DynamoDB Streams、Lambda 和 S3 版本化存储桶为 DynamoDb 中的数据创建增量备份:

https://www.abhayachauhan.com/category/aws/dynamodb/dynamodb-backups https://www.abhayachauhan.com/category/aws/dynamodb/dynamodb-backups

Edit:编辑:

As of Dec 2017, DynamoDB has released On Demand Backups/Restores.截至 2017 年 12 月,DynamoDB 发布了按需备份/恢复。 This allows you to take backups and store them natively in DynamoDB.这允许您进行备份并将它们本地存储在 DynamoDB 中。 They can be restored to a new table.它们可以恢复到新表中。 A detailed walk through is provided here, including code to schedule them:此处提供了详细的演练,包括安排它们的代码:

https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups

HTH HTH

You can use my simple node.js script dynamo-archive.js , which scans an entire Dynamo table and saves output to a JSON file.您可以使用我的简单 node.js 脚本dynamo-archive.js ,它会扫描整个 Dynamo 表并将输出保存到 JSON 文件。 Then, you upload it to S3 using s3cmd .然后,您使用s3cmd将其上传到 S3。

You can use this handy dynamodump tool which is python based (uses boto ) to dump the tables into JSON files.您可以使用这个方便的基于 python 的dynamodump工具(使用boto )将表转储到 JSON 文件中。 And then upload to S3 with s3cmd然后使用s3cmd上传到 S3

aws data pipeline has limit regions. aws 数据管道有限制区域。

It took me 2 hours to debug the template.调试模板花了我2个小时。

https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region

I found the dynamodb-backup lambda function to be really helpful.我发现dynamodb-backup lambda函数非常有用。 Took me 5 minutes to setup and can easily be configured to use a Cloudwatch Schedule event (don't forget to run npm install in the beginning though).我花了 5 分钟来设置,并且可以轻松配置为使用 Cloudwatch Schedule 事件(不过不要忘记在开始时运行npm install )。

It's also a lot cheaper for me coming from Data Pipeline (~$40 per month), I estimate the costs to be around 1.5 cents per month (both without S3 storage).对于来自 Data Pipeline 的我来说也便宜很多(每月约 40 美元),我估计成本约为每月 1.5 美分(两者都没有 S3 存储)。 Note that it backs up all DynamoDB tables at once by default, which can easily be adjusted within the code.请注意,默认情况下它会一次备份所有 DynamoDB 表,这可以在代码中轻松调整。

The only missing part is to be notified if the function fails, which the Data Pipeline was able to do.唯一缺少的部分是在函数失败时收到通知,数据管道能够做到这一点。

You can now backup your DynamoDB data straight to S3 natively, without using Data Pipeline or writing custom scripts.您现在可以直接将 DynamoDB 数据备份到本地 S3 ,而无需使用 Data Pipeline 或编写自定义脚本。 This is probably the easiest way to achieve what you wanted because it does not require you to write any code and run any task/script because it's fully managed.这可能是实现您想要的目标的最简单方法,因为它不需要您编写任何代码并运行任何任务/脚本,因为它是完全托管的。

Since 2020 you can export a DynamoDB table to S3 directly in the AWS UI:自 2020 年起,您可以直接在 AWS UI 中将 DynamoDB 表导出到 S3:

https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/ https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/

You need to activate PITR (Point in Time Recovery) first.您需要先激活 PITR(时间点恢复)。 You can choose between JSON and Amazon ION format.您可以在 JSON 和 Amazon ION 格式之间进行选择。

In the Java SDK (Version 2), you can do something like this:在 Java SDK(第 2 版)中,您可以执行以下操作:

// first activate PITR
PointInTimeRecoverySpecification pointInTimeRecoverySpecification  = PointInTimeRecoverySpecification
    .builder()
    .pointInTimeRecoveryEnabled(true)
    .build();
UpdateContinuousBackupsRequest updateContinuousBackupsRequest = UpdateContinuousBackupsRequest
    .builder()
    .tableName(myTable.getName())
    .pointInTimeRecoverySpecification(pointInTimeRecoverySpecification)
    .build();

UpdateContinuousBackupsResponse updateContinuousBackupsResponse;
try{
    updateContinuousBackupsResponse = dynamoDbClient.updateContinuousBackups(updateContinuousBackupsRequest);
}catch(Exception e){
    log.error("Point in Time Recovery Activation failed: {}",e.getMessage());
}
String updatedPointInTimeRecoveryStatus=updateContinuousBackupsResponse
    .continuousBackupsDescription()
    .pointInTimeRecoveryDescription()
    .pointInTimeRecoveryStatus()
    .toString();
log.info("Point in Time Recovery for Table {} activated: {}",myTable.getName(),
updatedPointInTimeRecoveryStatus);

// ... now get the table ARN
DescribeTableRequest describeTableRequest=DescribeTableRequest
    .builder()
    .tableName(myTable.getName())
    .build();

DescribeTableResponse describeTableResponse = dynamoDbClient.describeTable(describeTableRequest);
String tableArn = describeTableResponse.table().tableArn();
String s3Bucket = "myBucketName";

// choose the format (JSON or ION)
ExportFormat exportFormat=ExportFormat.ION;
ExportTableToPointInTimeRequest exportTableToPointInTimeRequest=ExportTableToPointInTimeRequest
    .builder()
    .tableArn(tableArn)
    .s3Bucket(s3Bucket)
    .s3Prefix(myTable.getS3Prefix())
    .exportFormat(exportFormat)
    .build();
dynamoDbClient.exportTableToPointInTime(exportTableToPointInTimeRequest);

Your dynamoDbClient needs to be an instance of software.amazon.awssdk.services.dynamodb.DynamoDbClient , the DynamoDbEnhancedClient or DynamoDbEnhancedAsyncClient will not work.您的 dynamoDbClient 需要是software.amazon.awssdk.services.dynamodb.DynamoDbClient的实例,DynamoDbEnhancedClient 或 DynamoDbEnhancedAsyncClient 将不起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM