[英]Is it possible to have a non parallel step in an Apache Beam / Dataflow job?
Suppose I have a python dataflow job in GCP that does the following 2 things: 假设我在GCP中有一个python数据流作业,它可以完成以下2件事:
Fetches some data from BigQuery 从BigQuery提取一些数据
Calls an external API in order to get a certain value and filters the data from BigQuery based on the fetched value 调用外部API以获取特定值并根据获取的值过滤BigQuery中的数据
I am able to do this, however for the second step the only way I figured out how to implement it was to have it as a class that extends DoFn
and call it in a parallel way later: 我能够做到这一点,但是,第二步,我弄清楚如何实现它的唯一方法是将其作为扩展
DoFn
的类,并在以后以并行方式调用它:
class CallExternalServiceAndFilter(beam.DoFn):
def to_runner_api_parameter(self, unused_context):
pass
def process(self, element, **kwargs):
# here I have to make the http call and figure out whether to yield the element or not,
# however this happens for each element of the set, as expected.
if element['property'] < response_body_parsed['some_other_property']:
logging.info("Yielding element")
yield element
else:
logging.info("Not yielding element")
with beam.Pipeline(options=PipelineOptions(), argv=argv) as p:
rows = p | 'Read data' >> beam.io.Read(beam.io.BigQuerySource(
dataset='test',
project=PROJECT,
query='Select * from test.table'
))
rows = rows | 'Calling external service and filtering items' >> beam.ParDo(CallExternalServiceAndFilter())
# ...
Is there any way that I can make the API call only once and then use the result in the parallel filtering step? 有什么方法可以使API调用一次,然后在并行过滤步骤中使用结果?
Use the __init__
function. 使用
__init__
函数。
class CallExternalServiceAndFilter(beam.DoFn):
def __init__():
self.response_body_parsed = call_api()
def to_runner_api_parameter(self, unused_context):
pass
def process(self, element, **kwargs):
# here I have to make the http call and figure out whether to yield the element or not,
# however this happens for each element of the set, as expected.
if element['property'] < self.response_body_parsed['some_other_property']:
logging.info("Yielding element")
yield element
else:
logging.info("Not yielding element")
Or better yet, just call your API beforehand (on your local machine that builds the pipeline), and assign the values in __init__
. 或者更好的是,只需预先调用API(在构建管道的本地计算机上),然后在
__init__
分配值。
reponse_body_parsed = call_api()
class CallExternalServiceAndFilter(beam.DoFn):
def __init__():
self.response_body_parsed = reponse_body_parsed
def to_runner_api_parameter(self, unused_context):
pass
def process(self, element, **kwargs):
# here I have to make the http call and figure out whether to yield the element or not,
# however this happens for each element of the set, as expected.
if element['property'] < self.response_body_parsed['some_other_property']:
logging.info("Yielding element")
yield element
else:
logging.info("Not yielding element")
You said that using setup
still does multiple calls. 您说使用
setup
仍然会进行多个调用。 Is this still the case with __init__
(if you do the API call in the DoFn, and not beforehand)? __init__
是否仍然是这种情况(如果您在DoFn中进行API调用,而不是事先进行)? The difference between __init__
and setup
is still unclear to me. 我仍然不清楚
__init__
和setup
之间的区别。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.