如何使用 Apache Beam 管理背压

Question

I have very basic apache beam pipeline which runs on GCP Dataflow and reads some data from a PubSub, transforms it and writes it to a Postgres DB.我有非常基本的 apache 光束管道，它在 GCP Dataflow 上运行并从 PubSub 读取一些数据，将其转换并将其写入 Postgres DB。 All this is done with standard readers/writers components of Apache Beam.所有这些都是通过 Apache Beam 的标准读取器/写入器组件完成的。 The issue is when my pipeline starts to receive really big amount of data, my Postgres end suffers of deadlock errors due to awaits of ShareLocks.问题是当我的管道开始接收大量数据时，由于等待 ShareLocks，我的 Postgres 端会出现死锁错误。

It's obvious that such things happen because of overflowing at Postgres end.很明显，这样的事情发生是因为在 Postgres 端溢出。 My pipeline tries to write too quickly and too many things at a time, so to avoid such situation it merely should slow down.我的管道试图一次写得太快和太多的东西，所以为了避免这种情况，它应该放慢速度。 Thus we may use a mechanisme such as backpressure.因此，我们可以使用诸如背压之类的机制。 I've tried to dig out any information about backpressure configuration for Apache Beam and unfortunately, the official documentation seems to be silent about such matters.我试图挖掘有关 Apache Beam 背压配置的任何信息，不幸的是，官方文档似乎对此类问题保持沉默。

I get overwhelmed with following kind of exceptions:我对以下异常感到不知所措：

java.sql.BatchUpdateException: Batch entry <NUMBER>
<MY_STATEMENT>
 was aborted: ERROR: deadlock detected
  Detail: Process 87768 waits for ShareLock on transaction 1939992; blocked by process 87769.
Process 87769 waits for ShareLock on transaction 1939997; blocked by process 87768.
  Hint: See server log for query details.
  Where: while inserting index tuple (5997152,9) in relation "<MY_TABLE>"  Call getNextException to see other errors in the batch.

I would like to know if there is any backpressure toolkit or something like that to help me manage my issue without writing my own PostgresIO.Writer .我想知道是否有任何背压工具包或类似的东西可以帮助我在不编写自己的PostgresIO.Writer情况下管理我的问题。

Many thanks.非常感谢。

Answer 1

Assuming that you use JdbcIO to write into Postgres, you can try to increase the batch size (see withBatchSize(long batchSize) ), which is 1K records by default, what is probably not enough.假设您使用JdbcIO写入Postgres，您可以尝试增加batch size（参见withBatchSize(long batchSize) ），默认为1K记录，可能还不够。

Also, in case of SQL exception, and you want to do retries then you need to make sure that you use a proper retry strategy (see withRetryStrategy(RetryStrategy retryStrategy) ).此外，如果出现 SQL 异常，并且您想要重试，那么您需要确保使用正确的重试策略（请参阅withRetryStrategy(RetryStrategy retryStrategy) ）。 In this case, FluentBackoff will be applied.在这种情况下，将应用FluentBackoff 。

如何使用 Apache Beam 管理背压

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-08-22 13:21:24

如何使用 Apache Beam 管理背压

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-08-22 13:21:24

解决方案1
1 已采纳 2019-08-22 13:21:24