Why does my dataflow job use very little cpu whereas it is not able to catch up the message backlog

Question

I am quite new to dataflow and I am having trouble having a job that works fast.

We are testing the following set-up: We stream in a Pub/Sub 1 event/s and have a dataflow pipeline supposed to read, parse and write to Bigquery.

The CPUs are used at 4/5% which seems not very efficient
The treatment seems very long but the parsing seems very short
Even by removing the last step it is still treating 0.5 event/s

Here is my pipeline code in python

class CustomParsing(beam.DoFn):
    """ Custom ParallelDo class to apply a custom transformation """

    @staticmethod
    def if_exists_in_data(data, key):
        if key in data.keys():
            return data[key]
        return None

    def to_runner_api_parameter(self, unused_context):
        return "beam:transforms:custom_parsing:custom_v0", None

    def process(self, message: beam.io.PubsubMessage, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
        data = json.loads(message.data.decode('utf-8'))
        yield {
            "data": message.data,
            "dataflow_timestamp": timestamp.to_rfc3339(),
            "created_at": self.if_exists_in_data(data, 'createdAt')
        }


def run():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--input_subscription",
        help='Input PubSub subscription of the form "projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'
    )
    parser.add_argument(
        "--output_table", help="Output BigQuery Table"
    )

    known_args, pipeline_args = parser.parse_known_args()

    # Creating pipeline options
    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(StandardOptions).streaming = True

    # Defining our pipeline and its steps
    p = beam.Pipeline(options=pipeline_options)
    (
        p
        | "ReadFromPubSub" >> beam.io.gcp.pubsub.ReadFromPubSub(
            subscription=known_args.input_subscription, timestamp_attribute=None, with_attributes=True
        )
        | "CustomParse" >> beam.ParDo(CustomParsing())
        | "WriteToBigQuery" >> beam.io.WriteToBigQuery(
            known_args.output_table,
            schema=BIGQUERY_SCHEMA,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
            create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER
        )
    )
    pipeline_result = p.run()


if __name__ == "__main__":

    run()

Then I launch my job with the following bash command

poetry run python $filepath \
      --streaming \
      --update \
      --runner DataflowRunner \
      --project $1 \
      --region xx \
      --subnetwork regions/xx/subnetworks/$3 \
      --temp_location gs://$4/temp \
      --job_name $jobname  \
      --max_num_workers 10 \
      --num_worker 2 \
      --input_subscription projects/$1/subscriptions/$jobname \
      --output_table $1:yy.$filename \
      --autoscaling_algorithm=THROUGHPUT_BASED \
      --dataflow_service_options=enable_prime

I am working with python 3.7.8 and apache-beam==2.32.0. I can't see any significant error log but I might not be looking for the logs that would help me debug.

Do you have any clue on what to look at in the logs or if I am doing something wrong on it?

Answer 1

Probably you have low performance because Dataflow has fused all your DoFn into one step. This means your steps run one after each other sequentially, not in parallel (and you'll be waiting for pubsub and bq api calls to complete).

To prevent fusion you can just add a Reshuffle() step in your pipeline after ReadFromPubSub():

p = beam.Pipeline(options=pipeline_options)
    (
        p
        | "ReadFromPubSub" >> beam.io.gcp.pubsub.ReadFromPubSub(
            subscription=known_args.input_subscription, timestamp_attribute=None, with_attributes=True
        ) 
        | "Prevent fusion" >> beam.transforms.util.Reshuffle()
        | "CustomParse" >> beam.ParDo(CustomParsing())
        | "WriteToBigQuery" >> beam.io.WriteToBigQuery(
            known_args.output_table,
            schema=BIGQUERY_SCHEMA,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
            create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER
        )
    )

Google has great documentation about preventing fusion and how to see if your running dataflow job has fused steps.

BTW - you can optimize your code removing your if_exists_in_data staticmethod which has the same functionality as dict.get() .

You can get rid of creating a list from data.keys() , iterating (which is O(n) ) just using data.get('createdAt'):

def process(self, message: beam.io.PubsubMessage, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
        data = json.loads(message.data.decode('utf-8'))
        yield {
            "data": message.data,
            "dataflow_timestamp": timestamp.to_rfc3339(),
            "created_at": data.get['createdAt']
        }

That optimization will help you when you have much more events/s than now;-).

Why does my dataflow job use very little cpu whereas it is not able to catch up the message backlog

Question

1 answers

solution1
1 2021-10-08 13:10:41

Why does my dataflow job use very little cpu whereas it is not able to catch up the message backlog

Question

1 answers

solution1 1 2021-10-08 13:10:41

solution1
1 2021-10-08 13:10:41