[英]Branching and Merging pcollection list in Apache Beam from common input
I'm building a dataflow pipeline and I'm having some trouble branching and merging outputs.我正在构建一个数据流管道,但在分支和合并输出时遇到了一些麻烦。 The pipeline I want to build is as follows:
我要构建的管道如下:
input_data
.input_data
。metric_1
, on input_data
. input_data
上提取一些指标metric_1
。 B. Extract some other metric, metric_2
, on input_data
input_data
上提取一些其他指标metric_2
input_data
and merge the outputs afterwards for further calculation.input_data
分支出来,然后合并输出以进行进一步的计算。 Merge outputs output
.output
。 Here's some sample code that encapsulates my actual pipeline这是一些封装我的实际管道的示例代码
class ReadData(beam.DoFn):
def process(self, element):
# read from source
return [{'input': np.random.rand(100,10)}]
class GetFirstMetric(beam.DoFn):
def process(self, element):
# some processing
return [{'first': np.random.rand(100,4)}]
class GetSecondMetric(beam.DoFn):
def process(self, element):
# some processing
return [{'second': np.random.rand(100,3)}]
def run():
with beam.Pipeline() as p:
input_data = (p | 'read sample data' >> beam.ParDo(ReadData()))
metric_1 = (input_data | 'some metric on input data' >> beam.ParDo(GetFirstMetric()))
metric_2 = (input_data | 'some aggregate metric' >> beam.ParDo(GetSecondMetric()))
output = ((metric_1, metric_2)
| beam.Flatten()
| beam.combiners.ToList()
| beam.Map(print)
)
When I run this, I get a 'PBegin' object has no attribute 'windowing'
error.当我运行它时,我得到一个
'PBegin' object has no attribute 'windowing'
错误。 I've seen some examples and sample code for doing something like this in Java.我在 Java 中看到了一些示例和示例代码。 But I couldn't find the right resources for doing the same in Python.
但是我在 Python 中找不到合适的资源来做同样的事情。 My question is as follows:
我的问题如下:
What's the right way to branch and merge pcollections (especially if the branches came from a common input)?分支和合并 pcollections 的正确方法是什么(特别是如果分支来自公共输入)?
Is there a better pipeline design for accomplishing the same?是否有更好的管道设计来完成同样的任务?
Thanks in advance!提前致谢!
In this code, your problem is that you are not 'starting' an initial PCollection.在此代码中,您的问题是您没有“启动”初始 PCollection。 In
ReadData.process
- what is the value of the variable element
?在
ReadData.process
- 变量element
的值是什么?
Well, the runner can't come up with a value, because there's no input pcollection.好吧,跑步者想不出一个值,因为没有输入 pcollection。 You need to create your first PCollection.
您需要创建您的第一个 PCollection。 You'd do something like the following code...
你会做类似下面的代码......
As for making them into a list - perhaps a side input strategy may work.至于将它们列在列表中 - 也许侧面输入策略可能会起作用。 CAn you try the following:
您可以尝试以下方法:
def run():
with beam.Pipeline() as p:
starter_pcoll = p | beam.Create(['any'])
input_data = (starter_pcoll | 'read sample data' >> beam.ParDo(ReadData()))
metric_1 = (input_data | 'some metric on input data' >> beam.ParDo(GetFirstMetric()))
metric_2 = (input_data | 'some aggregate metric' >> beam.ParDo(GetSecondMetric()))
side_in = beam.pvalue.AsList((metric_1, metric_2)
| beam.Flatten())
p | beam.Create(['any']) | beam.Map(lambda x, si: print(si),
side_in)
This should make your pipeline run.这应该使您的管道运行。 Happy to clarify about your specific questions further.
很高兴进一步澄清您的具体问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.