简体繁体 English

具有相同密钥 apache beam 的多个 CoGroupByKey

[英]Multiple CoGroupByKey with same key apache beam

原文 2017-07-12 19:41:40 0 1 google-cloud-dataflow/ dataflow/ apache-beam

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB).我有一种情况需要将管道中的主数据流 (1.5TB) 加入 2 个不同的数据集（4.92GB 和 17.35GB）。 The key that I use to do the CoGroupByKey for both are the same.我用来为两者执行 CoGroupByKey 的密钥是相同的。 Is there a way to avoid reshuffling the left side of the join after the first completes?有没有办法避免在第一个完成后重新洗牌左侧？ Currently I am just leaving the output as a KV>.目前我只是将输出保留为 KV>。 This seems to be better than emitting each element piecewise after the first join, but the second groupByKey still seems to be taking a lot longer than I would expect.这似乎比在第一次连接后分段发射每个元素要好，但第二个 groupByKey 似乎仍然比我预期的要花费更长的时间。 I was going to start looking into pulling apart CoGroupByKey to see if I can ignore grouping one side, but I really feel safer not going down to that level at this point.我打算开始研究拉开 CoGroupByKey 看看我是否可以忽略对一侧的分组，但我真的觉得此时不下降到那个级别更安全。

This was prior to keeping Iterables grouped after the first join这是在第一次加入后保持 Iterables 分组之前

1 个解决方案

Have you considered accessing the smaller datasets as View.asMap() or View.asMultimap() side inputs when processing the main input?在处理主输入时，您是否考虑过将较小的数据集作为View.asMap()或View.asMultimap()侧输入进行访问？ The Dataflow runner has an optimized implementation of map and multimap side inputs which performs key lookups efficiently without loading the whole data into memory. Dataflow runner 优化了 map 和 multimap 侧输入的实现，可以有效地执行键查找，而无需将整个数据加载到内存中。