I'm building a dataflow pipeline and I'm having some trouble branching and merging outputs. The pipeline I want to build is as follows:
input_data
.metric_1
, on input_data
. B. Extract some other metric, metric_2
, on input_data
input_data
and merge the outputs afterwards for further calculation. Merge outputs output
.Here's some sample code that encapsulates my actual pipeline
class ReadData(beam.DoFn):
def process(self, element):
# read from source
return [{'input': np.random.rand(100,10)}]
class GetFirstMetric(beam.DoFn):
def process(self, element):
# some processing
return [{'first': np.random.rand(100,4)}]
class GetSecondMetric(beam.DoFn):
def process(self, element):
# some processing
return [{'second': np.random.rand(100,3)}]
def run():
with beam.Pipeline() as p:
input_data = (p | 'read sample data' >> beam.ParDo(ReadData()))
metric_1 = (input_data | 'some metric on input data' >> beam.ParDo(GetFirstMetric()))
metric_2 = (input_data | 'some aggregate metric' >> beam.ParDo(GetSecondMetric()))
output = ((metric_1, metric_2)
| beam.Flatten()
| beam.combiners.ToList()
| beam.Map(print)
)
When I run this, I get a 'PBegin' object has no attribute 'windowing'
error. I've seen some examples and sample code for doing something like this in Java. But I couldn't find the right resources for doing the same in Python. My question is as follows:
What's the right way to branch and merge pcollections (especially if the branches came from a common input)?
Is there a better pipeline design for accomplishing the same?
Thanks in advance!
In this code, your problem is that you are not 'starting' an initial PCollection. In ReadData.process
- what is the value of the variable element
?
Well, the runner can't come up with a value, because there's no input pcollection. You need to create your first PCollection. You'd do something like the following code...
As for making them into a list - perhaps a side input strategy may work. CAn you try the following:
def run():
with beam.Pipeline() as p:
starter_pcoll = p | beam.Create(['any'])
input_data = (starter_pcoll | 'read sample data' >> beam.ParDo(ReadData()))
metric_1 = (input_data | 'some metric on input data' >> beam.ParDo(GetFirstMetric()))
metric_2 = (input_data | 'some aggregate metric' >> beam.ParDo(GetSecondMetric()))
side_in = beam.pvalue.AsList((metric_1, metric_2)
| beam.Flatten())
p | beam.Create(['any']) | beam.Map(lambda x, si: print(si),
side_in)
This should make your pipeline run. Happy to clarify about your specific questions further.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.