简体   繁体   English

如何用具有多对多关系的两个表在Java Mapreduce上执行减少侧连接?

[英]How to perform reduce side join on Java Mapreduce with two table which have many-to-many relationship?

First of all, I am not sure whether it is possible or not. 首先,我不确定是否可行。 If it is possible, I am still not sure whether it is the correct way of doing it. 如果可能的话,我仍然不确定这是否是正确的方法。

What I have is: 我所拥有的是:

  • Two large csv files called A and B on HDFS HDFS上的两个大型Csv文件,分别称为A和B
  • A has the followin columns: a1, a2, a3, a4 A具有以下列:a1,a2,a3,a4
  • B has the following columns: b1, b2, b3, b4, b5 B具有以下列:b1,b2,b3,b4,b5

What I want is: 我想要的是:

  • To join the two files if, let's say, a1=b1 假设a1 = b1,则要加入两个文件

The problem I have is: 我有的问题是:

  • If there is a many-to-many relationship between the two files on join keys, how can I perform this with Hadoop Mapreduce on Java? 如果联接密钥上的两个文件之间存在多对多关系,如何在Java上使用Hadoop Mapreduce执行此操作?

    As you can see from the illustration below, A has 4 matching row for a1=x and B has 2 matching row for b1=x. 从下图可以看到,A对于a1 = x有4个匹配行,而B对于b1 = x有2个匹配行。 Thus, joining the two tables on a1=b1=x produce 4*2 = 8 rows(combinations) as it is shown on the last table. 因此,将两个表合并为a1 = b1 = x会产生4 * 2 = 8行(组合),如最后一个表所示。 With a reduce side join, I could not manage to do that because this means increasing the key and value pairs which is against the nature of MapReduce. 使用减少的侧连接,我无法做到这一点,因为这意味着增加键和值对,这与MapReduce的本质背道而驰。

How can I perform such a thing? 我该怎么做?

Why it is a problem is: 为什么会出现问题是:

Let's say the table A is: 假设表A为:

a1  a2  a3          a4
x   1   somevalue   somevalue
x   2   somevalue   somevalue
x   3   somevalue   somevalue
x   4   somevalue   somevalue

Let's say the table B is: 假设表B为:

b1  b2  b3          b4          b5
x   i   somevalue   somevalue   somevalue
x   j   somevalue   somevalue   somevalue

The result of joining two files on a1=b1: 在a1 = b1上连接两个文件的结果:

a1  a2  b2
x   1   i
x   2   i
x   3   i
x   4   i
x   1   j
x   2   j
x   3   j
x   4   j

A full join will always produce M x N output values for each key. 完全联接将始终为每个键产生M x N输出值。

Note that, with a reduce side join, the number of intermediate keys pairs as emitted by the mappers would still be N + M and it is the reducer who does the Cartesian product. 请注意,在使用减少侧连接的情况下,由映射器发出的中间键对的数量仍将为N + M ,并且由减少器执行笛卡尔乘积。 So there is nothing wrong about that. 因此,这没有错。 Since you control the reducer, you can do further filtering and output only what you need. 由于您控制减速器,因此可以进行进一步的过滤并仅输出所需的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM