简体   繁体   English

具有相同密钥 apache beam 的多个 CoGroupByKey

[英]Multiple CoGroupByKey with same key apache beam

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB).我有一种情况需要将管道中的主数据流 (1.5TB) 加入 2 个不同的数据集(4.92GB 和 17.35GB)。 The key that I use to do the CoGroupByKey for both are the same.我用来为两者执行 CoGroupByKey 的密钥是相同的。 Is there a way to avoid reshuffling the left side of the join after the first completes?有没有办法避免在第一个完成后重新洗牌左侧? Currently I am just leaving the output as a KV>.目前我只是将输出保留为 KV>。 This seems to be better than emitting each element piecewise after the first join, but the second groupByKey still seems to be taking a lot longer than I would expect.这似乎比在第一次连接后分段发射每个元素要好,但第二个 groupByKey 似乎仍然比我预期的要花费更长的时间。 I was going to start looking into pulling apart CoGroupByKey to see if I can ignore grouping one side, but I really feel safer not going down to that level at this point.我打算开始研究拉开 CoGroupByKey 看看我是否可以忽略对一侧的分组,但我真的觉得此时不下降到那个级别更安全。

This was prior to keeping Iterables grouped after the first join这是在第一次加入后保持 Iterables 分组之前

Have you considered accessing the smaller datasets as View.asMap() or View.asMultimap() side inputs when processing the main input?在处理主输入时,您是否考虑过将较小的数据集作为View.asMap()View.asMultimap()侧输入进行访问? The Dataflow runner has an optimized implementation of map and multimap side inputs which performs key lookups efficiently without loading the whole data into memory. Dataflow runner 优化了 map 和 multimap 侧输入的实现,可以有效地执行键查找,而无需将整个数据加载到内存中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM