简体   繁体   English

Java 中的 AWS DynamoDB 和 MapReduce

[英]AWS DynamoDB and MapReduce in Java

I have a huge DynamoDB table that I want to analyze to aggregate data that is stored in its attributes.我有一个巨大的 DynamoDB 表,我想分析它以聚合存储在其属性中的数据。 The aggregated data should then be processed by a Java application.然后应由 Java 应用程序处理聚合数据。 While I understand the really basic concepts behind MapReduce, I've never used it before.虽然我了解 MapReduce 背后的真正基本概念,但我以前从未使用过它。

In my case, let's say that I have a customerId and orderNumbers attribute in every DynamoDB item, and that I can have more than one item for the same customer.就我而言,假设我在每个 DynamoDB 项目中都有一个customerIdorderNumbers属性,并且我可以为同一客户拥有多个项目 Like:像:

customerId: 1, orderNumbers: 2
customerId: 1, orderNumbers: 6
customerId: 2, orderNumbers: -1

Basically I want to sum the orderNumbers for each customerId, and then execute some operations in Java with the aggregate.基本上我想对每个 customerId 的 orderNumbers 求和,然后使用聚合在 Java 中执行一些操作。

AWS Elastic MapReduce could probably help me, but I don't understand how do I connect a custom JAR with DynamoDB. AWS Elastic MapReduce 可能对我有帮助,但我不明白如何将自定义 JAR 与 DynamoDB 连接。 My custom JAR probably needs to expose both a map and reduce functions, where can I find the right interface to implement?我的自定义 JAR 可能需要公开mapreduce函数,我在哪里可以找到正确的接口来实现?

Plus I'm a bit confused by the docs, it seems like I should first export my data to S3 before running my custom JAR. Is this correct?另外我对文档有点困惑,似乎我应该先将我的数据导出到 S3,然后再运行我的自定义 JAR。这是正确的吗?

Thanks谢谢

Note: I haven't built a working EMR, just read about it.注意:我还没有建立一个有效的 EMR,只是阅读它。

First of all, Prerequisites for Integrating Amazon EMR with Amazon DynamoDB首先, 将 Amazon EMR 与 Amazon DynamoDB 集成的先决条件

You can work directly on DynamoDB: Hive Command Examples for Exporting, Importing, and Querying Data in Amazon DynamoDB , As you can see you can do "SQL-like" queries that way.您可以直接在 DynamoDB 上工作: Hive 在 Amazon DynamoDB 中导出、导入和查询数据的命令示例,如您所见,您可以通过这种方式执行“类似 SQL”的查询。

If you have zero knowledge about Hadoop you should probably read some introduction material such as: What is Hadoop如果您对 Hadoop 的了解为零,您可能应该阅读一些介绍材料,例如: 什么是 Hadoop

This tutorial is another good read Using Amazon Elastic MapReduce with DynamoDB本教程是另一本很好的读物Using Amazon Elastic MapReduce with DynamoDB

Regarding your custom JAR application, you need to upload it to S3.关于您的自定义 JAR 应用程序,您需要将其上传到 S3。 Use this guide: How to Create a Job Flow Using a Custom JAR使用本指南: How to Create a Job Flow Using a custom JAR

I hope this will help you get started.我希望这会帮助您入门。

Also see: http://aws.amazon.com/code/Elastic-MapReduce/28549 - which also uses Hive to access DynamoDB.另请参阅: http://aws.amazon.com/code/Elastic-MapReduce/28549 - 它也使用 Hive 访问 DynamoDB。 This seems to be the official AWS way of accessing DynamoDB from Hadoop.这似乎是从 Hadoop 访问 DynamoDB 的官方 AWS 方式。

If you need to write custom code in a custom JAR, I found: DynamoDB InputFormat for Hadoop如果您需要在自定义 JAR 中编写自定义代码,我发现: DynamoDB InputFormat for Hadoop

However, I could not find documentation on the Java parameters to set for this InputFormat that correspond to the Hive parameters.但是,我找不到有关 Java 参数的文档,以为此 InputFormat 设置对应于 Hive 参数。 According to this article, it was not released by Amazon:http://www.newvem.com/amazon-dynamodb-part-iii-mapreducin-logs/根据这篇文章,它不是亚马逊发布的:http://www.newvem.com/amazon-dynamodb-part-iii-mapreducin-logs/

Also see: jar containing org.apache.hadoop.hive.dynamodb另见: jar 包含 org.apache.hadoop.hive.dynamodb

Therefore, the official, documented way to use DynamoDB data from a custom MapReduce job is to export the data DynamoDB to S3, then let Elastic MapReduce take it from S3.因此,从自定义 MapReduce 作业中使用 DynamoDB 数据的官方记录方法是将数据 DynamoDB 导出到 S3,然后让 Elastic MapReduce 从 S3 获取它。 My guess this is because because DynamoDB was designed to be accessed randomly as a key/value "NoSQL" store, while Hadoop input and output formats are for sequential access with large block sizes.我猜这是因为 DynamoDB 被设计为作为键/值“NoSQL”存储随机访问,而 Hadoop 输入和 output 格式用于大块大小的顺序访问。 The Amazon undocumented code could be some tricks to make up for this gap.亚马逊未记录的代码可能是弥补这一差距的一些技巧。

Since the export/re-import uses up resources, it would be best if the task can be accomplished from within Hive.由于导出/重新导入会耗尽资源,因此最好能在 Hive 内完成任务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM