简体   繁体   中英

DirectRunner does not read from Pub/Sub the way I specified with FixedWindows in Beam Java SDK

I am currently working on a Dataflow pipeline that reads streaming data from Pub/Sub with Apache Beam Java SDK 2.8.0. The pipeline is just the PubsubToText.java template from Google.


While deployment to the Cloud with DataflowRunner works as expected, it does not run correctly with DirectRunner, namely when I work on a local environment, making it so much hard to develop pipelines.

When I set the FixedWindows rate to 30s, for example, Dataflow Runner on the Cloud generates files every 30 seconds, which is expected.

When I set the same rate to the DirectRunner on a local environment, however, it won't emit files every 30 seconds. Instead, it generates files in a unstable way.

For example, it emits first data after 4 minues and creates 8 files that are supposed to have been created is actually generated at once, and next after 5 minues, next after 3 minutes, ... and so on, which makes the local development process extremely time-consuming and frustrating.

Why am I observing this?

Switching Java SDK from 8 to 11, Beam SDK from 2.8.0 to 2.9.0 or 2.10.0, environment from local to a GCE instance, nor pipeline output from GCS to local did not help.

Here is all to reproduce the problem:

  1. git clone https://github.com/GoogleCloudPlatform/DataflowTemplates
  2. remove the <scope>test</scope> line for beam-runners-direct-java from pom.xml to make it support DirectRunner at runtime.
  3. Do compiling and running the program, as suggested on https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/PubsubToText.java , but changing runner to DirectRunner and adding --outputShardTemplate=WP-SS-of-NN , which is a omitted option and required when running locally.
  4. Remove --project , --stagingLocation , and tempLocation lines simultaneously, since it won't be deployed to the Cloud.
  5. It takes extremely long to emits files, although I set, for example, windowDuration=30s

I suspected it was a Pub/Sub related problem, but when I run tcpdump, it starts to connect to Pub/Sub and pull data immediately. It's likely to be a DirectRunner specific issue.

While I don't know why this happens, I found the resolution to this problem. While DataflowRunner does not require for you to set triggers for it to work as it's supposed to, you must specify a explicit trigger for DirectRunner .

Appending .trrigering to the Window.into, the problem goes away.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM