如何將兩個結果和 pipe 組合到 apache-beam 管道中的下一步

Question

請參見下面的代碼片段，我希望["metric1", "metric2"]作為 RunTask.process 的輸入。 然而，它分別用“metric1”和“metric2”運行了兩次

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (metric1, metric2) | beam.Flatten() | beam.ParDo(RunTask()) # I want ["metric1", "metric2"] to be my input for RunTask.process. However it was run twice with "metric1" and "metric2" respectively

Answer 1

我了解您想以遵循以下語法的方式加入兩個 PCollection： ['element1','element2'] 。 為了實現這一點，您可以使用CoGroupByKey()而不是Flatten() 。

考慮到您的代碼片段，語法將：

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (
       (metric1, metric2) 
       | beam.CoGroupByKey() 
       | beam.ParDo(RunTask()) 
 )

我想指出 Flatten() 和 CoGroupByKey() 之間的區別。

1) Flatten()接收兩個或多個PCollection，存儲相同的數據類型，合並為一個邏輯PCollection。 例如，

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

adress_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(adress_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

resul =(
    (street,city)
    |beam.Flatten()
    |ParDo(print)
)

p.run()

而output，

('leo', 'George St. 32')
('ralph', 'Pyrmont St. 30')
('mary', '10th Av.')
('carly', 'Marina Bay 1')
('leo', 'Sydney')
('ralph', 'Sydney')
('mary', 'NYC')
('carly', 'Brisbane')

請注意，兩個 PCollection 都在 output 中。 但是，一個附加到另一個。

2) CoGroupByKey()執行兩個或多個具有相同鍵類型的鍵值 PCollection 之間的關系連接。 使用此方法，您將通過鍵執行連接，而不是像 Flatten() 中所做的那樣追加。 下面是一個例子，

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

address_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(address_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

results = (
    (street, city)
    | beam.CoGroupByKey()
    |ParDo(print)
    #| beam.io.WriteToText('delete.txt')
    
)

p.run()

而output，

('leo', (['George St. 32'], ['Sydney']))
('ralph', (['Pyrmont St. 30'], ['Sydney']))
('mary', (['10th Av.'], ['NYC']))
('carly', (['Marina Bay 1'], ['Brisbane']))

請注意，您需要一個主鍵才能加入結果。 此外，這 output 是您在您的情況下所期望的。

Answer 2

或者，使用側面輸入：

metrics3 = metric1 | beam.ParDo(RunTask(), metric2=beam.pvalue.AsIter(metric2))

在 RunTask 進程（）中：

def process(self, element_from_metric1, metric2):
  ...

如何將兩個結果和 pipe 組合到 apache-beam 管道中的下一步

問題描述

2 個解決方案

解決方案1
4 已采納 2020-07-24 09:25:37

解決方案2
0 2020-07-30 20:27:16

如何將兩個結果和 pipe 組合到 apache-beam 管道中的下一步

問題描述

2 個解決方案

解決方案1 4 已采納 2020-07-24 09:25:37

解決方案2 0 2020-07-30 20:27:16

解決方案1
4 已采納 2020-07-24 09:25:37

解決方案2
0 2020-07-30 20:27:16