速度和 memory 权衡将 Apache Beam PCollection 一分为二

Question

I've got a PCollection where each element is a key, values tuple like this: (key, (value1,..,value_n) )我有一个 PCollection ，其中每个元素都是一个键，值元组是这样的： (key, (value1,..,value_n) )

I need to split this PCollection in two processing branches.我需要将此 PCollection 拆分为两个处理分支。

As always, I need the whole pipeline to be as fast and use as little ram as possible.与往常一样，我需要整个管道尽可能快并使用尽可能少的内存。

Two ideas come to my mind:我想到了两个想法：

Option 1: Split the PColl with a DoFn with multiple outputs选项 1：使用具有多个输出的 DoFn 拆分 PColl

class SplitInTwo(beam.DoFn):

   def process(self, kvpair):
       key, values = kvpair
       
       yield beam.TaggedOutput('left', (key, values[0:2]))
       yield beam.TaggedOutput('right', (key, values[2:]))

class ProcessLeft(beam.DoFn):
   def process(self, kvpair):
       key,values = kvpair
       ...
       yield (key, results)

# class ProcessRight is similar to ProcessLeft

And then build the pipeline like this然后像这样构建管道

   splitme = pcoll | beam.ParDo(SplitInTwo()).with_outputs('left','right')
   left = splitme.left | beam.ParDo(ProcessLeft())
   right = splitme.right | beam.ParDo(ProcessRight())

Option 2: Use two different DoFn on the original PCollection选项 2：在原始 PCollection 上使用两个不同的 DoFn

Another option is using two DoFns to read and process the same PCollection.另一种选择是使用两个 DoFns 来读取和处理同一个 PCollection。 Just using one for the 'left' and 'right' hand sides of the data:仅将一个用于数据的“左侧”和“右侧”：

class ProcessLeft(beam.DoFn):

   def process(self, kvpair):
       key = kvpair[0]
       values = kvpair[0][0:2]
       ...
       yield (key,result)

# class ProcessRight is similar to ProcessLeft

Building the pipleline is simpler... (plus you don't need to track which tagged outputs you have):构建管道更简单......（而且您不需要跟踪您拥有哪些标记输出）：

   left = pcoll | beam.ParDo(ProcessLeft())
   right = pcoll| beam.ParDo(ProcessRight())

But... is it faster?但是……更快吗？ will need less memory than the first one?需要的 memory 比第一个少吗？

(I'm thinking about the first option might be fused by the runner - not just a Dataflow runner). （我正在考虑第一个选项可能会被跑步者融合 - 而不仅仅是数据流跑步者）。

Answer 1

In this case, both options would be fused by the runner, so both options would be somewhat similar in terms of performance.在这种情况下，两个选项都将由 runner 融合，因此这两个选项在性能方面会有些相似。 If you would like to reshuffle data into separate workers, then Option 1 is your best choice, as the serialized collection read by ProcessLeft and ProcessRight would be smaller.如果您想将数据重新洗牌到单独的工作人员中，那么选项 1是您的最佳选择，因为ProcessLeft和ProcessRight读取的序列化集合会更小。

   splitme = pcoll | beam.ParDo(SplitInTwo()).with_outputs('left','right')
   left = splitme.left | beam.Reshuffle() | beam.ParDo(ProcessLeft())
   right = splitme.right | beam.Reshuffle() | beam.ParDo(ProcessRight())

The Reshuffle transform would ensure that your data is written to an intermediate shuffle, and then consumed downstream. Reshuffle转换将确保您的数据被写入中间 shuffle，然后在下游使用。 This would break the fusion.这会破坏融合。

速度和 memory 权衡将 Apache Beam PCollection 一分为二

问题描述

Option 1: Split the PColl with a DoFn with multiple outputs选项 1：使用具有多个输出的 DoFn 拆分 PColl

Option 2: Use two different DoFn on the original PCollection选项 2：在原始 PCollection 上使用两个不同的 DoFn

1 个解决方案

解决方案1
2 已采纳 2020-12-04 18:13:43

速度和 memory 权衡将 Apache Beam PCollection 一分为二

问题描述

Option 1: Split the PColl with a DoFn with multiple outputs选项 1：使用具有多个输出的 DoFn 拆分 PColl

Option 2: Use two different DoFn on the original PCollection选项 2：在原始 PCollection 上使用两个不同的 DoFn

1 个解决方案

解决方案1 2 已采纳 2020-12-04 18:13:43

解决方案1
2 已采纳 2020-12-04 18:13:43