在Apache Beam / Dataflow作业中是否可以有非并行步骤？

Question

Suppose I have a python dataflow job in GCP that does the following 2 things: 假设我在GCP中有一个python数据流作业，它可以完成以下2件事：

Fetches some data from BigQuery 从BigQuery提取一些数据
Calls an external API in order to get a certain value and filters the data from BigQuery based on the fetched value 调用外部API以获取特定值并根据获取的值过滤BigQuery中的数据

I am able to do this, however for the second step the only way I figured out how to implement it was to have it as a class that extends DoFn and call it in a parallel way later: 我能够做到这一点，但是，第二步，我弄清楚如何实现它的唯一方法是将其作为扩展DoFn的类，并在以后以并行方式调用它：

class CallExternalServiceAndFilter(beam.DoFn):
    def to_runner_api_parameter(self, unused_context):
        pass

    def process(self, element, **kwargs):
        # here I have to make the http call and figure out whether to yield the element or not,
        # however this happens for each element of the set, as expected.
        if element['property'] < response_body_parsed['some_other_property']:
            logging.info("Yielding element")
            yield element
        else:
            logging.info("Not yielding element")

with beam.Pipeline(options=PipelineOptions(), argv=argv) as p:
    rows = p | 'Read data' >> beam.io.Read(beam.io.BigQuerySource(
        dataset='test',
        project=PROJECT,
        query='Select * from test.table'
    ))

    rows = rows | 'Calling external service and filtering items' >> beam.ParDo(CallExternalServiceAndFilter())

    # ...

Is there any way that I can make the API call only once and then use the result in the parallel filtering step? 有什么方法可以使API调用一次，然后在并行过滤步骤中使用结果？

Answer 1

Use the __init__ function. 使用__init__函数。

class CallExternalServiceAndFilter(beam.DoFn):
    def __init__():
        self.response_body_parsed = call_api()

    def to_runner_api_parameter(self, unused_context):
        pass

    def process(self, element, **kwargs):
        # here I have to make the http call and figure out whether to yield the element or not,
        # however this happens for each element of the set, as expected.
        if element['property'] < self.response_body_parsed['some_other_property']:
            logging.info("Yielding element")
            yield element
        else:
            logging.info("Not yielding element")

Or better yet, just call your API beforehand (on your local machine that builds the pipeline), and assign the values in __init__ . 或者更好的是，只需预先调用API（在构建管道的本地计算机上），然后在__init__分配值。

reponse_body_parsed = call_api()

class CallExternalServiceAndFilter(beam.DoFn):
    def __init__():
        self.response_body_parsed = reponse_body_parsed

    def to_runner_api_parameter(self, unused_context):
        pass

    def process(self, element, **kwargs):
        # here I have to make the http call and figure out whether to yield the element or not,
        # however this happens for each element of the set, as expected.
        if element['property'] < self.response_body_parsed['some_other_property']:
            logging.info("Yielding element")
            yield element
        else:
            logging.info("Not yielding element")

You said that using setup still does multiple calls. 您说使用setup仍然会进行多个调用。 Is this still the case with __init__ (if you do the API call in the DoFn, and not beforehand)? __init__是否仍然是这种情况（如果您在DoFn中进行API调用，而不是事先进行）？ The difference between __init__ and setup is still unclear to me. 我仍然不清楚__init__和setup之间的区别。

在Apache Beam / Dataflow作业中是否可以有非并行步骤？

问题描述

1 个解决方案

解决方案1
0 2019-08-22 19:02:24

在Apache Beam / Dataflow作业中是否可以有非并行步骤？

问题描述

1 个解决方案

解决方案1 0 2019-08-22 19:02:24

解决方案1
0 2019-08-22 19:02:24