简体   繁体   English

如何将两个结果和 pipe 组合到 apache-beam 管道中的下一步

[英]How to combine two results and pipe it to next step in apache-beam pipeline

See below code snippet, I want ["metric1", "metric2"] to be my input for RunTask.process.请参见下面的代码片段,我希望["metric1", "metric2"]作为 RunTask.process 的输入。 However it was run twice with "metric1" and "metric2" respectively然而,它分别用“metric1”和“metric2”运行了两次

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (metric1, metric2) | beam.Flatten() | beam.ParDo(RunTask()) # I want ["metric1", "metric2"] to be my input for RunTask.process. However it was run twice with "metric1" and "metric2" respectively

  

I understand that you want to join two PCollections in a way they follow this syntax: ['element1','element2'] .我了解您想以遵循以下语法的方式加入两个 PCollection: ['element1','element2'] In order to achieve that you can use CoGroupByKey() instead of Flatten() .为了实现这一点,您可以使用CoGroupByKey()而不是Flatten()

Considering your code snippet, the syntax would:考虑到您的代码片段,语法将:

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (
       (metric1, metric2) 
       | beam.CoGroupByKey() 
       | beam.ParDo(RunTask()) 
 )

I would like to point out the difference between Flatten() and CoGroupByKey().我想指出 Flatten() 和 CoGroupByKey() 之间的区别。

1) Flatten() receives two or more PCollections, which stores the same data type, and merge them into one logical PCollection. 1) Flatten()接收两个或多个PCollection,存储相同的数据类型,合并为一个逻辑PCollection。 For example,例如,

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

adress_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(adress_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

resul =(
    (street,city)
    |beam.Flatten()
    |ParDo(print)
)

p.run()

And the output,而output,

('leo', 'George St. 32')
('ralph', 'Pyrmont St. 30')
('mary', '10th Av.')
('carly', 'Marina Bay 1')
('leo', 'Sydney')
('ralph', 'Sydney')
('mary', 'NYC')
('carly', 'Brisbane')

Notice that, both PCollections are in the output.请注意,两个 PCollection 都在 output 中。 However, one is appended to the other.但是,一个附加到另一个。

2) CoGroupByKey() performs a relational join between two or more key value PCollections, which have the same key type. 2) CoGroupByKey()执行两个或多个具有相同键类型的键值 PCollection 之间的关系连接。 Using this method you will perform a join by key, not appending as done in Flatten().使用此方法,您将通过键执行连接,而不是像 Flatten() 中所做的那样追加。 Below is an example,下面是一个例子,

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

address_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(address_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

results = (
    (street, city)
    | beam.CoGroupByKey()
    |ParDo(print)
    #| beam.io.WriteToText('delete.txt')
    
)

p.run()

And the output,而output,

('leo', (['George St. 32'], ['Sydney']))
('ralph', (['Pyrmont St. 30'], ['Sydney']))
('mary', (['10th Av.'], ['NYC']))
('carly', (['Marina Bay 1'], ['Brisbane']))

Notice that you need a primary key in order to join the results.请注意,您需要一个主键才能加入结果。 Also, this output is what you expect in your case.此外,这 output 是您在您的情况下所期望的。

Alternatively, use side input:或者,使用侧面输入:

metrics3 = metric1 | beam.ParDo(RunTask(), metric2=beam.pvalue.AsIter(metric2))

in RunTask process():在 RunTask 进程()中:

def process(self, element_from_metric1, metric2):
  ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM