简体   繁体   中英

DirectRunner does not read from Pub/Sub the way I specified with FixedWindows in Beam Java SDK

I am currently working on a Dataflow pipeline that reads streaming data from Pub/Sub with Apache Beam Java SDK 2.8.0. The pipeline is just the PubsubToText.java template from Google.

https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToText.java

While deployment to the Cloud with DataflowRunner works as expected, it does not run correctly with DirectRunner, namely when I work on a local environment, making it so much hard to develop pipelines.

When I set the FixedWindows rate to 30s, for example, Dataflow Runner on the Cloud generates files every 30 seconds, which is expected.

When I set the same rate to the DirectRunner on a local environment, however, it won't emit files every 30 seconds. Instead, it generates files in a unstable way.

For example, it emits first data after 4 minues and creates 8 files that are supposed to have been created is actually generated at once, and next after 5 minues, next after 3 minutes, ... and so on, which makes the local development process extremely time-consuming and frustrating.

Why am I observing this?

Switching Java SDK from 8 to 11, Beam SDK from 2.8.0 to 2.9.0 or 2.10.0, environment from local to a GCE instance, nor pipeline output from GCS to local did not help.

Here is all to reproduce the problem:

  1. git clone https://github.com/GoogleCloudPlatform/DataflowTemplates
  2. remove the <scope>test</scope> line for beam-runners-direct-java from pom.xml to make it support DirectRunner at runtime.
  3. Do compiling and running the program, as suggested on https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToText.java , but changing runner to DirectRunner and adding --outputShardTemplate=WP-SS-of-NN , which is a omitted option and required when running locally.
  4. Remove --project , --stagingLocation , and tempLocation lines simultaneously, since it won't be deployed to the Cloud.
  5. It takes extremely long to emits files, although I set, for example, windowDuration=30s

I suspected it was a Pub/Sub related problem, but when I run tcpdump, it starts to connect to Pub/Sub and pull data immediately. It's likely to be a DirectRunner specific issue.

While I don't know why this happens, I found the resolution to this problem. While DataflowRunner does not require for you to set triggers for it to work as it's supposed to, you must specify a explicit trigger for DirectRunner .

Appending .trrigering to the Window.into, the problem goes away.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM