简体   繁体   English

在Apache Beam中联接行

[英]Joining rows in Apache Beam

I'm having trouble understanding if the joins in Apache Beam (eg http://www.waitingforcode.com/apache-beam/joins-apache-beam/read ) can join entire rows. 我无法理解Apache Beam中的联接(例如http://www.waitingforcode.com/apache-beam/joins-apache-beam/read )是否可以联接整行。

For example: 例如:

I have 2 datasets, in CSV format, where the first rows are column headers. 我有2个CSV格式的数据集,其中第一行是列标题。

The first: 首先:

a,b,c,d
1,2,3,4
5,6,7,8
1,2,5,4

The second: 第二:

c,d,e,f
3,4,9,10

I want to left join on columns c and d so that I end up with: 我想在c和d列上保留连接,以便最终得到:

a,b,c,d,e,f
1,2,3,4,9,10
5,6,7,8,,
1,2,5,4,,

However all the documentation on Apache Beam seems to say the PCollection objects need to be of type KV<K, V> when joining, so I have broken down my PCollection objects to a collection of KV<String, String> objects (where the key is the column header, and the value is row value). 但是,Apache Beam上的所有文档似乎都说加入时PCollection对象的类型必须为KV<K, V> ,因此我将PCollection对象分解为KV<String, String>对象的集合(其中的键是列标题,而值是行值)。 But in that case (where you just have a key with a value) I don't see how the row format can be maintained. 但是在那种情况下(您只有一个带有值的键)我看不到如何保持行格式。 How would KV(c,7) "know" that KV(a,5) is from the same row? KV(c,7)如何“知道” KV(a,5)来自同一行? Is Join meant for this sort of thing at all? Join根本就是用于这种事情吗?

My code so far: 到目前为止,我的代码:

PCollection<KV<String, String>> flightOutput = ...;
PCollection<KV<String, String>> arrivalWeatherDataForJoin = ...;
PCollection<KV<String, KV<String, String>>> output = Join.leftOuterJoin(flightOutput, arrivalWeatherDataForJoin, "");

Yes, Join is the utility class to help with joins like yours. 是的, Join是实用程序类,可帮助您进行类似的联接。 It is a wrapper around CoGropByKey , see the corresponding section in the docs. 它是CoGropByKey的包装,请参阅文档中的相应部分 The implementation of it is pretty short . 它的实现很短 Its tests might also have helpful examples. 它的测试可能还会有一些有用的示例。

Problem in your case is likely caused by how you're choosing the keys. 您的问题很可能是由您如何选择键引起的。

The KeyT int KV<KeyT,V1> in the Join library represents the key which you are using to match the records, it contains all the join fields. Join库中的KeyT int KV<KeyT,V1>表示您用于匹配记录的键,它包含所有连接字段。 So in your case you will probably need to assign keys something like this (pseudocode): 因此,在您的情况下,您可能需要分配类似以下的键(伪代码):

pCollection1:

    Key     Value
   (3,4)  (1,2,3,4)
   (7,8)  (5,6,7,8)
   (5,4)  (1,2,5,4)

pCollection2:

    Key     Value
   (3,4)  (3,4,9,10)

And what will come of the join will look something like this (pseudocode): 联接的结果将如下所示(伪代码):

joinResultPCollection:

   Key              Value
  (3,4)      (1,2,3,4),(3,4,9,10)
  (7,8)      (5,6,7,8),nullValue
  (5,4)      (1,2,5,4),nullValue

So you will probably need to add another transform after join to actually merge the left and right side into a combined row. 因此,您可能需要在连接后添加另一个转换,以将左侧和右侧实际合并到合并的行中。

Because you have a CSV, you probably could use actual strings like "3,4" as keys (and values). 因为您有CSV,所以您可能可以使用"3,4"类的实际字符串作为键(和值)。 Or you could use Lists<> or your custom row types. 或者,您可以使用Lists<>或自定义行类型。

For example, this is exactly what Beam SQL Join implementation does. 例如,这正是Beam SQL Join实现的功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM