限制 Beam/Dataflow stream 作业中已处理元素的数量

Question

I have a Beam streaming job running on Dataflow runner.我有一个在 Dataflow 运行器上运行的 Beam 流式传输作业。 It loads requests from PubSub (using Python's apache_beam.io.ReadFromPubSub ), then fetches data from BigTable, do a heavy computation on the data and writes to PubSub again.它从 PubSub 加载请求（使用 Python 的apache_beam.io.ReadFromPubSub ），然后从 BigTable 获取数据，对数据进行大量计算并再次写入 PubSub。

with beam.Pipeline(options=pipeline_options) as pipeline:
        (
            pipeline
            | "Receive" >> beam.io.ReadFromPubSub(topic=TOPIC_READ)
            | "Parse" >> beam.ParDo(Parse())
            | "Fetch" >> beam.ParDo(FetchFromBigtable(project, args.bt_instance, args.bt_par, args.bt_batch))
            | "Process" >> beam.ParDo(Process())
            | "Publish" >> beam.io.WriteToPubSub(topic=TOPIC_WRITE)
        )

Basically I don't need any windowing, I would like just limit the number of elements processed in parallel on 1 machine (ie control the parallelism by number of workers).基本上我不需要任何窗口，我只想限制在一台机器上并行处理的元素数量（即通过工人数量控制并行度）。 Otherwise it causes out of memory during the heavy computation and I also need to limit the rate of BigTable requests.否则在繁重的计算过程中会导致 memory 溢出，我还需要限制 BigTable 请求的速率。

I'm using standard 2 CPU machine so I would expect that it would process 2 elemets in parallel - I also set --number_of_worker_harness_threads=2 and --sdk_worker_parallelism=1 .我使用的是标准的 2 CPU 机器，所以我希望它可以并行处理 2 个元素 - 我还设置了 --number_of_worker_harness_threads --number_of_worker_harness_threads=2和--sdk_worker_parallelism=1 。 For some reason though I'm seeing many elements processed in parallel by multiple threads which cause the memory and rate limit problems.出于某种原因，尽管我看到许多元素由多个线程并行处理，这会导致 memory 和速率限制问题。 I guess those are bundles processed in parallel based on the logs (eg work: "process_bundle-105" ).我猜这些是基于日志并行处理的包（例如work: "process_bundle-105" ）。

I tried to hack it by using a semaphore inside processElement (to just process one element per DoFN instance) and it works, but the autoscaling does not kick off and it looks like a pure hack which may have other consequences.我试图通过在processElement中使用信号量来破解它（每个 DoFN 实例只处理一个元素）并且它可以工作，但是自动缩放并没有启动，它看起来像一个纯粹的破解，可能会产生其他后果。

What would you recommend?你会推荐什么？ How can I limit the number of parallel bundles to be processed?如何限制要处理的并行包的数量？ Ideally just one bundle per worker harness thread?理想情况下，每个工人线束线只有一束？ Is beam/dataflow suitable for such use case or is it better to achieve it with plain kubernetes with autoscaling?光束/数据流是否适合这种用例，还是使用带有自动缩放功能的普通 kubernetes 来实现它更好？

EDIT:编辑：

Running on Beam SDK 2.28.0在梁 SDK 2.28.0 上运行

I'd like to limit the parallelism, but I have not described well symptoms that leaded me to that conclusion.我想限制并行性，但我没有很好地描述导致我得出这个结论的症状。

Sometimes I got timeouts in the Fetch stage有时我在Fetch阶段超时

Deadline of 60.0s exceeded while calling functools.partial(<bound method PartialRowsData._read_next of <google.cloud.bigtable.row_data.PartialRowsData object at 0x7f16b405ba50>>)

Processing of one element in Process stage slows down significantly (into minutes instead of seconds) and sometimes it even gets stuck (probably because of memory problems).在Process阶段处理一个元素会显着减慢（到几分钟而不是几秒钟），有时甚至会卡住（可能是因为 memory 问题）。

Below are logs from one worker logged before and after processing of 1 element in Process stage (single-threaded) filtered by jsonPayload.worker and jsonPayload.portability_worker_id (ie I hope those should be logs from one container).以下是在Process阶段（单线程）中由jsonPayload.worker和jsonPayload.portability_worker_id过滤的 1 个元素处理之前和之后记录的一名工作人员的日志（即，我希望这些应该是来自一个容器的日志）。 I can see much more than 12 elements being processed in a single moment.我可以看到一瞬间处理了超过 12 个元素。

Answer 1

I have had success solving this same kind of problem for Dataflow and Elasticsearch by making use of Stateful Processing .我通过使用Stateful Processing成功解决了 Dataflow 和 Elasticsearch 的此类问题。 You could use GroupIntoBatches to reduce the parallelism if your sink cannot keep up with the pace of the rest of the pipeline.如果您的接收器无法跟上管道的 rest 的速度，您可以使用GroupIntoBatches来降低并行度。

As far as I understand, states are maintained by the runner on a per-key-per-window basis.据我了解，状态由运行程序在每个键每个窗口的基础上维护。 To use stateful processing, your data will need to have keys.要使用有状态处理，您的数据需要有键。 Those keys can be arbitrary and ignored by the DoFn you use to consume the elements.这些键可以是任意的，并且被用于使用元素的 DoFn 忽略。

You mentioned that you don't need windowing, and if you're not using any windowing currently that would imply that you're using the default singular Global Window.您提到您不需要开窗，如果您当前没有使用任何开窗，这意味着您正在使用默认的单数 Global Window。 In this case, whatever number of distinct keys you arbitrarily assign to your data will be the maximum number of parallelized states maintained.在这种情况下，您任意分配给数据的不同键的数量将是维护的最大并行状态数。 Just be aware that this solution won't be portable to all runners as Stateful processing is not globally supported by all runners.请注意，此解决方案不适用于所有跑步者，因为并非所有跑步者都全局支持有状态处理。

Answer 2

Dataflow launches one SDK worker container per core so in your case there will be 2 worker containers(processes) per machine. Dataflow 每个核心启动一个 SDK 工作容器，因此在您的情况下，每台机器将有 2 个工作容器（进程）。 Each worker process has an unbounded thread pool for processing the bundles but I think only one bundle gets processed with one thread at a time due to python GIL.每个工作进程都有一个无界线程池来处理包，但我认为由于 python GIL，一次只能用一个线程处理一个包。

You can --experiments no_use_multiple_sdk_containers to limit the sdk container number to one (since it seems that your use case does not care about throughput that much).您可以--experiments no_use_multiple_sdk_containers将 sdk 容器编号限制为 1（因为您的用例似乎不太关心吞吐量）。

限制 Beam/Dataflow stream 作业中已处理元素的数量

问题描述

2 个解决方案

解决方案1
1 2021-03-30 17:35:53

解决方案2
1 2021-03-30 17:46:19

限制 Beam/Dataflow stream 作业中已处理元素的数量

问题描述

2 个解决方案

解决方案1 1 2021-03-30 17:35:53

解决方案2 1 2021-03-30 17:46:19

解决方案1
1 2021-03-30 17:35:53

解决方案2
1 2021-03-30 17:46:19