简体繁体中英

Messages stuck in GBP when streaming from multiple PubSub topics to BigQuery using DataFlow?

原文 2018-08-23 11:08:47 9 1 java/ google-bigquery/ google-cloud-dataflow/ google-cloud-pubsub/ apache-beam-io

I have a Java DataFlow pipeline with the following parts:

PubSub subscriber reading from several topics
Flatten.pCollections operation
Transform from PubsubMessage to TableRow
BigQuery writer to write all to a dynamic table

When there's more than one PubSub-topic in the list of subscriptions to connect to, all elements get stuck in the GroupByKey operation in the Reshuffle operation within the BigQuery writer. I've let it run for a couple of hours after sending several dozen test messages, but nothing is written to BigQuery.

I found the following three work-arounds (each of them works separately from the others)

Add a 'withTimestampAttribute' call on the Pubsub subscriptions. The name of the attribute does not matter at all - it can be any existing or non-existing attribute on the incoming messages
Reduce the number of PubSub subscriptions to just 1
Remove the Flatten.pCollections operation in between, creating multiple separate pipelines doing the exact same thing

The messages are not intentionally timestamped - writing them to BigQuery using just the PubsubMessage timestamp is completely acceptable.

It also confuses me that even adding a non-existing timestamp attribute seems to fix the issue. I debugged the issue to print out the timestamps within the pipeline, and they are comparable in both cases; when specifying a non-existing timestamp attribute, it seems to fall back to the pubsub timestamp anyway.

What could be causing this issue? How can I resolve it? For me, the most acceptable work-around is removing the Flatten.pCollections operation as it doesn't strictly complicate the code, but I can't get my head around the reason why it fails.

1 answers

Did you apply windowing to your pipeline? The Beam documentation warns you about using a unbounded PCollection (like Pub/Sub) without any windowing or triggering:

If you don't set a non-global windowing function or a non-default trigger for your unbounded PCollection and subsequently use a grouping transform such as GroupByKey or Combine, your pipeline will generate an error upon construction and your job will fail.

In your case the pipeline does not fail on construction, but the messages are stuck in the GroupByKey, because it is waiting for the window to end. Try adding a window before the BigQuery writer and see if that solves the problem.

Dataflow Template/Pattern in enriching fixed BigQuery data by streaming Pubsub data

Cloud Dataflow, PubSub & Bigquery Issues

Exception when reading BigQuery from Dataflow template using ValueProvider

Cloud Dataflow, PubSub & Bigquery (TableRowJsonCoder) Issues

Apache Beam / Dataflow - PubSub lost messages

How to consume messages from multiple topics?

Dataflow job uses same BigQuery job ID when deploying using a staged template multiple times?

Dataflow send PubSub message after BigQuery write completion

Read from PubSub in Dataflow: set the subscription name

Read the data from Google Cloud Sql to BigQuery using Clud Dataflow

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Dataflow Template/Pattern in enriching fixed BigQuery data by streaming Pubsub data Cloud Dataflow, PubSub & Bigquery Issues Exception when reading BigQuery from Dataflow template using ValueProvider Cloud Dataflow, PubSub & Bigquery (TableRowJsonCoder) Issues Apache Beam / Dataflow - PubSub lost messages How to consume messages from multiple topics? Dataflow job uses same BigQuery job ID when deploying using a staged template multiple times? Dataflow send PubSub message after BigQuery write completion Read from PubSub in Dataflow: set the subscription name Read the data from Google Cloud Sql to BigQuery using Clud Dataflow

Related Tags

Messages stuck in GBP when streaming from multiple PubSub topics to BigQuery using DataFlow?

Question

1 answers

solution1 2 ACCPTED 2018-08-23 11:51:11

solution1
2 ACCPTED 2018-08-23 11:51:11