如何用具有多对多关系的两个表在Java Mapreduce上执行减少侧连接？

Question

First of all, I am not sure whether it is possible or not. 首先，我不确定是否可行。 If it is possible, I am still not sure whether it is the correct way of doing it. 如果可能的话，我仍然不确定这是否是正确的方法。

What I have is: 我所拥有的是：

Two large csv files called A and B on HDFS HDFS上的两个大型Csv文件，分别称为A和B
A has the followin columns: a1, a2, a3, a4 A具有以下列：a1，a2，a3，a4
B has the following columns: b1, b2, b3, b4, b5 B具有以下列：b1，b2，b3，b4，b5

What I want is: 我想要的是：

To join the two files if, let's say, a1=b1 假设a1 = b1，则要加入两个文件

The problem I have is: 我有的问题是：

If there is a many-to-many relationship between the two files on join keys, how can I perform this with Hadoop Mapreduce on Java? 如果联接密钥上的两个文件之间存在多对多关系，如何在Java上使用Hadoop Mapreduce执行此操作？
As you can see from the illustration below, A has 4 matching row for a1=x and B has 2 matching row for b1=x. 从下图可以看到，A对于a1 = x有4个匹配行，而B对于b1 = x有2个匹配行。 Thus, joining the two tables on a1=b1=x produce 4*2 = 8 rows(combinations) as it is shown on the last table. 因此，将两个表合并为a1 = b1 = x会产生4 * 2 = 8行（组合），如最后一个表所示。 With a reduce side join, I could not manage to do that because this means increasing the key and value pairs which is against the nature of MapReduce. 使用减少的侧连接，我无法做到这一点，因为这意味着增加键和值对，这与MapReduce的本质背道而驰。

How can I perform such a thing? 我该怎么做？

Why it is a problem is: 为什么会出现问题是：

Let's say the table A is: 假设表A为：

a1  a2  a3          a4
x   1   somevalue   somevalue
x   2   somevalue   somevalue
x   3   somevalue   somevalue
x   4   somevalue   somevalue

Let's say the table B is: 假设表B为：

b1  b2  b3          b4          b5
x   i   somevalue   somevalue   somevalue
x   j   somevalue   somevalue   somevalue

The result of joining two files on a1=b1: 在a1 = b1上连接两个文件的结果：

a1  a2  b2
x   1   i
x   2   i
x   3   i
x   4   i
x   1   j
x   2   j
x   3   j
x   4   j

Answer 1

A full join will always produce M x N output values for each key. 完全联接将始终为每个键产生M x N输出值。

Note that, with a reduce side join, the number of intermediate keys pairs as emitted by the mappers would still be N + M and it is the reducer who does the Cartesian product. 请注意，在使用减少侧连接的情况下，由映射器发出的中间键对的数量仍将为N + M ，并且由减少器执行笛卡尔乘积。 So there is nothing wrong about that. 因此，这没有错。 Since you control the reducer, you can do further filtering and output only what you need. 由于您控制减速器，因此可以进行进一步的过滤并仅输出所需的内容。

如何用具有多对多关系的两个表在Java Mapreduce上执行减少侧连接？

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-08-29 21:16:13

如何用具有多对多关系的两个表在Java Mapreduce上执行减少侧连接？

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-08-29 21:16:13

解决方案1
0 已采纳 2014-08-29 21:16:13