简体   繁体   English

限制 Beam/Dataflow stream 作业中已处理元素的数量

[英]Limit number of processed elements in Beam/Dataflow stream job

I have a Beam streaming job running on Dataflow runner.我有一个在 Dataflow 运行器上运行的 Beam 流式传输作业。 It loads requests from PubSub (using Python's apache_beam.io.ReadFromPubSub ), then fetches data from BigTable, do a heavy computation on the data and writes to PubSub again.它从 PubSub 加载请求(使用 Python 的apache_beam.io.ReadFromPubSub ),然后从 BigTable 获取数据,对数据进行大量计算并再次写入 PubSub。

with beam.Pipeline(options=pipeline_options) as pipeline:
        (
            pipeline
            | "Receive" >> beam.io.ReadFromPubSub(topic=TOPIC_READ)
            | "Parse" >> beam.ParDo(Parse())
            | "Fetch" >> beam.ParDo(FetchFromBigtable(project, args.bt_instance, args.bt_par, args.bt_batch))
            | "Process" >> beam.ParDo(Process())
            | "Publish" >> beam.io.WriteToPubSub(topic=TOPIC_WRITE)
        )

Basically I don't need any windowing, I would like just limit the number of elements processed in parallel on 1 machine (ie control the parallelism by number of workers).基本上我不需要任何窗口,我只想限制在一台机器上并行处理的元素数量(即通过工人数量控制并行度)。 Otherwise it causes out of memory during the heavy computation and I also need to limit the rate of BigTable requests.否则在繁重的计算过程中会导致 memory 溢出,我还需要限制 BigTable 请求的速率。

I'm using standard 2 CPU machine so I would expect that it would process 2 elemets in parallel - I also set --number_of_worker_harness_threads=2 and --sdk_worker_parallelism=1 .我使用的是标准的 2 CPU 机器,所以我希望它可以并行处理 2 个元素 - 我还设置了 --number_of_worker_harness_threads --number_of_worker_harness_threads=2--sdk_worker_parallelism=1 For some reason though I'm seeing many elements processed in parallel by multiple threads which cause the memory and rate limit problems.出于某种原因,尽管我看到许多元素由多个线程并行处理,这会导致 memory 和速率限制问题。 I guess those are bundles processed in parallel based on the logs (eg work: "process_bundle-105" ).我猜这些是基于日志并行处理的包(例如work: "process_bundle-105" )。

在此处输入图像描述

I tried to hack it by using a semaphore inside processElement (to just process one element per DoFN instance) and it works, but the autoscaling does not kick off and it looks like a pure hack which may have other consequences.我试图通过在processElement中使用信号量来破解它(每个 DoFN 实例只处理一个元素)并且它可以工作,但是自动缩放并没有启动,它看起来像一个纯粹的破解,可能会产生其他后果。

What would you recommend?你会推荐什么? How can I limit the number of parallel bundles to be processed?如何限制要处理的并行包的数量? Ideally just one bundle per worker harness thread?理想情况下,每个工人线束线只有一束? Is beam/dataflow suitable for such use case or is it better to achieve it with plain kubernetes with autoscaling?光束/数据流是否适合这种用例,还是使用带有自动缩放功能的普通 kubernetes 来实现它更好?

EDIT:编辑:

Running on Beam SDK 2.28.0在梁 SDK 2.28.0 上运行

I'd like to limit the parallelism, but I have not described well symptoms that leaded me to that conclusion.我想限制并行性,但我没有很好地描述导致我得出这个结论的症状。

  1. Sometimes I got timeouts in the Fetch stage有时我在Fetch阶段超时
Deadline of 60.0s exceeded while calling functools.partial(<bound method PartialRowsData._read_next of <google.cloud.bigtable.row_data.PartialRowsData object at 0x7f16b405ba50>>)
  1. Processing of one element in Process stage slows down significantly (into minutes instead of seconds) and sometimes it even gets stuck (probably because of memory problems).Process阶段处理一个元素会显着减慢(到几分钟而不是几秒钟),有时甚至会卡住(可能是因为 memory 问题)。

Below are logs from one worker logged before and after processing of 1 element in Process stage (single-threaded) filtered by jsonPayload.worker and jsonPayload.portability_worker_id (ie I hope those should be logs from one container).以下是在Process阶段(单线程)中由jsonPayload.workerjsonPayload.portability_worker_id过滤的 1 个元素处理之前和之后记录的一名工作人员的日志(即,我希望这些应该是来自一个容器的日志)。 I can see much more than 12 elements being processed in a single moment.我可以看到一瞬间处理了超过 12 个元素。

流程阶段的日志

I have had success solving this same kind of problem for Dataflow and Elasticsearch by making use of Stateful Processing .我通过使用Stateful Processing成功解决了 Dataflow 和 Elasticsearch 的此类问题。 You could use GroupIntoBatches to reduce the parallelism if your sink cannot keep up with the pace of the rest of the pipeline.如果您的接收器无法跟上管道的 rest 的速度,您可以使用GroupIntoBatches来降低并行度。

As far as I understand, states are maintained by the runner on a per-key-per-window basis.据我了解,状态由运行程序在每个键每个窗口的基础上维护。 To use stateful processing, your data will need to have keys.要使用有状态处理,您的数据需要有键。 Those keys can be arbitrary and ignored by the DoFn you use to consume the elements.这些键可以是任意的,并且被用于使用元素的 DoFn 忽略。

You mentioned that you don't need windowing, and if you're not using any windowing currently that would imply that you're using the default singular Global Window.您提到您不需要开窗,如果您当前没有使用任何开窗,这意味着您正在使用默认的单数 Global Window。 In this case, whatever number of distinct keys you arbitrarily assign to your data will be the maximum number of parallelized states maintained.在这种情况下,您任意分配给数据的不同键的数量将是维护的最大并行状态数。 Just be aware that this solution won't be portable to all runners as Stateful processing is not globally supported by all runners.请注意,此解决方案不适用于所有跑步者,因为并非所有跑步者都全局支持有状态处理。

Dataflow launches one SDK worker container per core so in your case there will be 2 worker containers(processes) per machine. Dataflow 每个核心启动一个 SDK 工作容器,因此在您的情况下,每台机器将有 2 个工作容器(进程)。 Each worker process has an unbounded thread pool for processing the bundles but I think only one bundle gets processed with one thread at a time due to python GIL.每个工作进程都有一个无界线程池来处理包,但我认为由于 python GIL,一次只能用一个线程处理一个包。

You can --experiments no_use_multiple_sdk_containers to limit the sdk container number to one (since it seems that your use case does not care about throughput that much).您可以--experiments no_use_multiple_sdk_containers将 sdk 容器编号限制为 1(因为您的用例似乎不太关心吞吐量)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 梁:CombinePerKey(max)挂在数据流作业中 - Beam: CombinePerKey(max) hang in dataflow job Beam / Dataflow 自定义 Python 作业 - Cloud Storage 到 PubSub - Beam / Dataflow Custom Python job - Cloud Storage to PubSub 在Apache Beam / Dataflow作业中是否可以有非并行步骤? - Is it possible to have a non parallel step in an Apache Beam / Dataflow job? Apache Beam Dataflow 作业在本地做什么? - What does an Apache Beam Dataflow job do locally? 如何使用 Dataflow 在 apache 光束中跳过 io 级别的错误元素? - How to skip erroneous elements at io level in apache beam with Dataflow? 如何计算Apache Beam中PCollection的元素数量 - How to calculate the number of elements of a PCollection in Apache beam 数据流作业挂起,在Beam 2.14中找不到ext4文件系统 - Dataflow job hang by Can't find ext4 filesystem in beam 2.14 在从 CircleCI 启动的 Dataflow/Apache-beam 作业中找不到库 - Libraries cannot be found on Dataflow/Apache-beam job launched from CircleCI 使用 setup.py 在数据流中运行 apache 光束作业时出现 ModuleNotFoundError - ModuleNotFoundError while running apache beam job in dataflow using setup.py 遇到 assertionError:在使用 Beam/Dataflow 无限期等待后,作业没有达到终端状态 - Encounter assertionError: Job did not reach to a terminal state after waiting indefinitely with Beam/Dataflow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM