如何将两个结果和 pipe 组合到 apache-beam 管道中的下一步

Question

See below code snippet, I want ["metric1", "metric2"] to be my input for RunTask.process.请参见下面的代码片段，我希望["metric1", "metric2"]作为 RunTask.process 的输入。 However it was run twice with "metric1" and "metric2" respectively然而，它分别用“metric1”和“metric2”运行了两次

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (metric1, metric2) | beam.Flatten() | beam.ParDo(RunTask()) # I want ["metric1", "metric2"] to be my input for RunTask.process. However it was run twice with "metric1" and "metric2" respectively

Answer 1

I understand that you want to join two PCollections in a way they follow this syntax: ['element1','element2'] .我了解您想以遵循以下语法的方式加入两个 PCollection： ['element1','element2'] 。 In order to achieve that you can use CoGroupByKey() instead of Flatten() .为了实现这一点，您可以使用CoGroupByKey()而不是Flatten() 。

Considering your code snippet, the syntax would:考虑到您的代码片段，语法将：

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (
       (metric1, metric2) 
       | beam.CoGroupByKey() 
       | beam.ParDo(RunTask()) 
 )

I would like to point out the difference between Flatten() and CoGroupByKey().我想指出 Flatten() 和 CoGroupByKey() 之间的区别。

1) Flatten() receives two or more PCollections, which stores the same data type, and merge them into one logical PCollection. 1) Flatten()接收两个或多个PCollection，存储相同的数据类型，合并为一个逻辑PCollection。 For example,例如，

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

adress_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(adress_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

resul =(
    (street,city)
    |beam.Flatten()
    |ParDo(print)
)

p.run()

And the output,而output，

('leo', 'George St. 32')
('ralph', 'Pyrmont St. 30')
('mary', '10th Av.')
('carly', 'Marina Bay 1')
('leo', 'Sydney')
('ralph', 'Sydney')
('mary', 'NYC')
('carly', 'Brisbane')

Notice that, both PCollections are in the output.请注意，两个 PCollection 都在 output 中。 However, one is appended to the other.但是，一个附加到另一个。

2) CoGroupByKey() performs a relational join between two or more key value PCollections, which have the same key type. 2) CoGroupByKey()执行两个或多个具有相同键类型的键值 PCollection 之间的关系连接。 Using this method you will perform a join by key, not appending as done in Flatten().使用此方法，您将通过键执行连接，而不是像 Flatten() 中所做的那样追加。 Below is an example,下面是一个例子，

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

address_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(address_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

results = (
    (street, city)
    | beam.CoGroupByKey()
    |ParDo(print)
    #| beam.io.WriteToText('delete.txt')
    
)

p.run()

And the output,而output，

('leo', (['George St. 32'], ['Sydney']))
('ralph', (['Pyrmont St. 30'], ['Sydney']))
('mary', (['10th Av.'], ['NYC']))
('carly', (['Marina Bay 1'], ['Brisbane']))

Notice that you need a primary key in order to join the results.请注意，您需要一个主键才能加入结果。 Also, this output is what you expect in your case.此外，这 output 是您在您的情况下所期望的。

Answer 2

Alternatively, use side input:或者，使用侧面输入：

metrics3 = metric1 | beam.ParDo(RunTask(), metric2=beam.pvalue.AsIter(metric2))

in RunTask process():在 RunTask 进程（）中：

def process(self, element_from_metric1, metric2):
  ...

如何将两个结果和 pipe 组合到 apache-beam 管道中的下一步

问题描述

2 个解决方案

解决方案1
4 已采纳 2020-07-24 09:25:37

解决方案2
0 2020-07-30 20:27:16

如何将两个结果和 pipe 组合到 apache-beam 管道中的下一步

问题描述

2 个解决方案

解决方案1 4 已采纳 2020-07-24 09:25:37

解决方案2 0 2020-07-30 20:27:16

解决方案1
4 已采纳 2020-07-24 09:25:37

解决方案2
0 2020-07-30 20:27:16