为什么我的数据流作业使用很少的 CPU 而它却无法赶上消息积压

Question

I am quite new to dataflow and I am having trouble having a job that works fast.我对数据流很陌生，而且我无法快速完成工作。

We are testing the following set-up: We stream in a Pub/Sub 1 event/s and have a dataflow pipeline supposed to read, parse and write to Bigquery.我们正在测试以下设置：我们 stream 在 Pub/Sub 1 事件中，并且有一个数据流管道应该读取、解析和写入 Bigquery。

The CPUs are used at 4/5% which seems not very efficient CPU 使用率为 4/5%，这似乎不是很有效
The treatment seems very long but the parsing seems very short治疗看起来很长，但解析似乎很短
Even by removing the last step it is still treating 0.5 event/s即使删除最后一步，它仍在处理 0.5 个事件/秒

Here is my pipeline code in python这是我在 python 中的管道代码

class CustomParsing(beam.DoFn):
    """ Custom ParallelDo class to apply a custom transformation """

    @staticmethod
    def if_exists_in_data(data, key):
        if key in data.keys():
            return data[key]
        return None

    def to_runner_api_parameter(self, unused_context):
        return "beam:transforms:custom_parsing:custom_v0", None

    def process(self, message: beam.io.PubsubMessage, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
        data = json.loads(message.data.decode('utf-8'))
        yield {
            "data": message.data,
            "dataflow_timestamp": timestamp.to_rfc3339(),
            "created_at": self.if_exists_in_data(data, 'createdAt')
        }


def run():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--input_subscription",
        help='Input PubSub subscription of the form "projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'
    )
    parser.add_argument(
        "--output_table", help="Output BigQuery Table"
    )

    known_args, pipeline_args = parser.parse_known_args()

    # Creating pipeline options
    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(StandardOptions).streaming = True

    # Defining our pipeline and its steps
    p = beam.Pipeline(options=pipeline_options)
    (
        p
        | "ReadFromPubSub" >> beam.io.gcp.pubsub.ReadFromPubSub(
            subscription=known_args.input_subscription, timestamp_attribute=None, with_attributes=True
        )
        | "CustomParse" >> beam.ParDo(CustomParsing())
        | "WriteToBigQuery" >> beam.io.WriteToBigQuery(
            known_args.output_table,
            schema=BIGQUERY_SCHEMA,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
            create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER
        )
    )
    pipeline_result = p.run()


if __name__ == "__main__":

    run()

Then I launch my job with the following bash command然后我使用以下 bash 命令启动我的工作

poetry run python $filepath \
      --streaming \
      --update \
      --runner DataflowRunner \
      --project $1 \
      --region xx \
      --subnetwork regions/xx/subnetworks/$3 \
      --temp_location gs://$4/temp \
      --job_name $jobname  \
      --max_num_workers 10 \
      --num_worker 2 \
      --input_subscription projects/$1/subscriptions/$jobname \
      --output_table $1:yy.$filename \
      --autoscaling_algorithm=THROUGHPUT_BASED \
      --dataflow_service_options=enable_prime

I am working with python 3.7.8 and apache-beam==2.32.0.我正在使用 python 3.7.8 和 apache-beam==2.32.0。 I can't see any significant error log but I might not be looking for the logs that would help me debug.我看不到任何重要的错误日志，但我可能不是在寻找有助于我调试的日志。

Do you have any clue on what to look at in the logs or if I am doing something wrong on it?您是否知道要在日志中查看什么或者我是否做错了什么？

Answer 1

Probably you have low performance because Dataflow has fused all your DoFn into one step.您的性能可能很低，因为 Dataflow 已将所有 DoFn 融合到一个步骤中。 This means your steps run one after each other sequentially, not in parallel (and you'll be waiting for pubsub and bq api calls to complete).这意味着您的步骤一个接一个地按顺序运行，而不是并行运行（并且您将等待 pubsub 和 bq api 调用完成）。

To prevent fusion you can just add a Reshuffle() step in your pipeline after ReadFromPubSub():为了防止融合，您可以在 ReadFromPubSub() 之后在您的管道中添加一个Reshuffle()步骤：

p = beam.Pipeline(options=pipeline_options)
    (
        p
        | "ReadFromPubSub" >> beam.io.gcp.pubsub.ReadFromPubSub(
            subscription=known_args.input_subscription, timestamp_attribute=None, with_attributes=True
        ) 
        | "Prevent fusion" >> beam.transforms.util.Reshuffle()
        | "CustomParse" >> beam.ParDo(CustomParsing())
        | "WriteToBigQuery" >> beam.io.WriteToBigQuery(
            known_args.output_table,
            schema=BIGQUERY_SCHEMA,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
            create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER
        )
    )

Google has great documentation about preventing fusion and how to see if your running dataflow job has fused steps. Google 有关于防止融合以及如何查看您正在运行的数据流作业是否融合了步骤的优秀文档。

BTW - you can optimize your code removing your if_exists_in_data staticmethod which has the same functionality as dict.get() .顺便说一句 - 你可以优化你的代码删除你的if_exists_in_data静态方法，它具有与dict.get()相同的功能。

You can get rid of creating a list from data.keys() , iterating (which is O(n) ) just using data.get('createdAt'):您可以摆脱从data.keys()创建列表，仅使用 data.get('createdAt') 迭代（这是 O(n) ）：

def process(self, message: beam.io.PubsubMessage, timestamp=beam.DoFn.TimestampParam, window=beam.DoFn.WindowParam):
        data = json.loads(message.data.decode('utf-8'))
        yield {
            "data": message.data,
            "dataflow_timestamp": timestamp.to_rfc3339(),
            "created_at": data.get['createdAt']
        }

That optimization will help you when you have much more events/s than now;-).当您有比现在更多的事件时，该优化将对您有所帮助；-)。

为什么我的数据流作业使用很少的 CPU 而它却无法赶上消息积压

问题描述

1 个解决方案

解决方案1
1 2021-10-08 13:10:41

为什么我的数据流作业使用很少的 CPU 而它却无法赶上消息积压

问题描述

1 个解决方案

解决方案1 1 2021-10-08 13:10:41

解决方案1
1 2021-10-08 13:10:41