Cloud Dataflow GlobalWindow trigger ignored

Question

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.

The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.

The complete code is:

      val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
        .applyTransform(ParDo.of(new Translator()))
        .map(row => row.mkString("|"))
        // produce one global window with one pane per ~500 records
        .withGlobalWindow(WindowOptions(
          trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
          accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
        ))

      val out = TextIO
        .write()
        .to("gs://test-bucket/staging")
        .withSuffix(".txt")
        .withNumShards(1)
        .withShardNameTemplate("-P-S")
        .withWindowedWrites() // gets us one file per window & pane
      pipe.saveAsCustomOutput("writer",out)

I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:

    public void processElement(ProcessContext context) throws Exception {
      try (PreparedStatement statement =
          connection.prepareStatement(
              query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
        statement.setFetchSize(fetchSize);
        parameterSetter.setParameters(context.element(), statement);
        try (ResultSet resultSet = statement.executeQuery()) {
          while (resultSet.next()) {
            context.output(rowMapper.mapRow(resultSet));
          }
        }
      }
    }

Answer 1

In the end, I had two problems to resolve: 1. The process would run out of memory, and 2. the data was written to a single file.

There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement . There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.

For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows .

Answer 2

elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.

You have a couple of options when doing this for a batch pipeline:

Allow the runner to decide how big the files are and how many shards are written:

val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
        .applyTransform(ParDo.of(new Translator()))
        .map(row => row.mkString("|"))

      val out = TextIO
        .write()
        .to("gs://test-bucket/staging")
        .withSuffix(".txt")
      pipe.saveAsCustomOutput("writer",out)

This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.

Manually group into batches using the GroupIntoBatches transform.

val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
        .applyTransform(ParDo.of(new Translator()))
        .map(row => row.mkString("|"))
        .apply(GroupIntoBatches.ofSize(500))

      val out = TextIO
        .write()
        .to("gs://test-bucket/staging")
        .withSuffix(".txt")
        .withNumShards(1)
      pipe.saveAsCustomOutput("writer",out)

In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.

There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

Cloud Dataflow GlobalWindow trigger ignored

Question

2 answers

solution1
1 2020-01-15 13:43:59

solution2
0 2020-01-09 18:52:11

Cloud Dataflow GlobalWindow trigger ignored

Question

2 answers

solution1 1 2020-01-15 13:43:59

solution2 0 2020-01-09 18:52:11

solution1
1 2020-01-15 13:43:59

solution2
0 2020-01-09 18:52:11