简体   繁体   中英

Apache beam blocked on unbounded side input

My question is very similar to another post: Apache Beam Cloud Dataflow Streaming Stuck Side Input .

However, I tried the resolution there (apply GlobalWindows() to the side input), and it did not seem to fix my problem.

I have a Dataflow pipeline (but I'm using DirectRunner for debug) with Python SDK where the main input are logs from PubSub and the side input is associated data from a mostly unchanging database. I would like to join the two such that each log is paired with side input data from the same approximate time. Excess side inputs without an associated log can be dropped.

The behavior I see is that the pipeline seems to be operating as a single thread. It processes the all side input elements first, then starts processing the main input elements. If the side input is bounded (non-streaming), this is fine, and the pipeline can merge inputs and run to completion. If the side input is unbounded (streaming), however, the main input is blocked indefinitely while apparently waiting for the side input processing to finish.

To illustrate the behavior, I made simplified test case below.

class Logger(apache_beam.DoFn):

  def __init__(self, name):
    self._name = name

  def process(self, element, w=apache_beam.DoFn.WindowParam,
              ts=apache_beam.DoFn.TimestampParam):
    logging.error('%s: %s', self._name, element)
    yield element

def cross_join(left, rights):
  for right in rights:
    yield (left, right)

def main():
  start = timestamp.Timestamp.now()

  # Bounded side inputs work OK.
  stop = start + 20

  # Unbounded side inputs appear to block execution of main input
  # processing.
  #stop = timestamp.MAX_TIMESTAMP

  side_interval = 5
  main_interval = 1

  side_input = (
      pipeline
      | PeriodicImpulse(
          start_timestamp=start,
          stop_timestamp=stop,
          fire_interval=side_interval,
          apply_windowing=True)
      | apache_beam.Map(lambda x: ('side', x))
      | apache_beam.ParDo(Logger('side_input'))
  )
  main_input = (
      pipeline
      | PeriodicImpulse(
          start_timestamp=start, stop_timestamp=stop,
          fire_interval=main_interval, apply_windowing=True)
      | apache_beam.Map(lambda x: ('main', x))
      | apache_beam.ParDo(Logger('main_input'))
      | 'CrossJoin' >> apache_beam.FlatMap(
          cross_join, rights=apache_beam.pvalue.AsIter(side_input))
      | 'CrossJoinLogger' >> apache_beam.ParDo(Logger('cross_join_output'))
  )
  pipeline.run()

Am missing something that is preventing main inputs from being processed in parallel with the side inputs?

The main input can advance only when the watermark has passed the corresponding side input's windowing. See details in the programming guide . You likely need to window both the main and side input, and make sure PeriodicImpulse is correctly advancing the watermark.

Using the example from stackoverflow.com/q/70561769 I was able to get the side input and main input working concurrently as expected for certain cases. The answer there was to apply GlobalWindows() to the side_input.

side_input = ( pipeline
  | PeriodicImpulse(fire_interval=300, apply_windowing=False)
  | "ApplyGlobalWindow" >> WindowInto(window.GlobalWindows(),
      trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5)),
      accumulation_mode=trigger.AccumulationMode.DISCARDING)
  | ...
)

Based on experimentation, my conclusion is there are cases when PeriodicImpulse on the side input causes the main input to block, such as below:

Case 1: GOOD
GlobalWindow
Main input = PubSub
Side input = PeriodicImpulse
Case 2: BAD
FixedWindow
Main input = PubSub
Side input = PeriodicImpulse
Case 3: BAD
GlobalWindow / FixedWindow
Main input = PeriodicImpulse
Side input = PeriodicImpulse
Case 4: GOOD
FixedWindow
Main input = PubSub
Side input = PubSub

My problem now is that the side input timestamps are not aligning with the main input properly stackoverflow.com/q/72382440 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM