[英]Apache beam python to use multiple shared handler in one single pipeline
[英]Use apache beam arguments within the pipeline
从pipeline_options
获取 arguments 的最佳实践是什么?
虚拟代码示例:
known_args, pipeline_args = parser.parse_known_args()
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as pipeline:
# here I want to use project argument
# I can't do pipeline.options.project
# because warning is displayed
(
pipeline
| "Transformation 1" >> beam.Map(lambda x: known_args.pubsub_sub) # this is fine
| "Transformation 2" >> beam.Map(lambda x: pipeline.options.project) # this is not fine
)
如何使用管道(项目、区域等)所需的那些标准 arguments,而不是那些用户定义的?
我认为最佳做法是使用如下选项,我保留了您的初始代码:
class MyPipelineOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_argument("--project", help="Project", required=True)
parser.add_argument("--pubsub_sub", help="Pub Sub", required=True)
my_pipeline_options = PipelineOptions().view_as(MyPipelineOptions)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as pipeline:
# here I want to use project argument
# I can't do pipeline.options.project
# because warning is displayed
(
pipeline
| "Transformation 1" >> beam.Map(lambda x: my_pipeline_options.pubsub_sub)
| "Transformation 2" >> beam.Map(lambda x: my_pipeline_options.project)
)
我认为对于像project
这样的预定义选项,您必须将它们添加到MyPipelineOptions
class 才能在您的 Python 代码中使用它。
无需通过管道 object 访问options
,只需直接使用pipeline_options
。
感谢您的回答。
因此,您的答案 + Beam 文档为我提供了选项管理的全貌。 总结一下:
argparse
构建简单的解析器。known_args
和beam_args
对象。known_args
应该直接在我们的管道代码中使用(例如known_args.pubsub_topic
)。beam_args
用于创建PipelineOptions
object,它将被传递到Pipeline()
object。project
、 streaming
等)已经定义的一些参数,我们不应该在我们的自定义解析器中覆盖它 - 我们应该创建view_as()
object 并使用它显式(如known_args
)。 下面举两个例子。project
参数。# "project" will go into beam_args because we didn't define it in our parser
known_args, beam_args = parser.parse_known_args()
pipeline_options = PipelineOptions(beam_args)
# this argument will be available in "gcp_args" because GoogleCloudOptions class
# defining "project" argument (you can check source code)
gcp_args = pipeline_options.view_as(GoogleCloudOptions)
# and next if we want to use it somewhere we should do:
gcp_args.project
streaming
参数。# "streaming" will go into beam_args because we didn't define it in our parser
known_args, beam_args = parser.parse_known_args()
pipeline_options = PipelineOptions(beam_args)
# this argument will be available in "std_args" because StandardOptions class
# defining "streaming" argument (you can check source code)
std_args = pipeline_options.view_as(StandardOptions)
# and next if we want to use it somewhere we should do:
std_args.streaming
所以实际上为了查看 Apache 光束已经定义了哪些 arguments 我们应该查看Github 源代码
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.