How to use aggregateField() over multiple columns in Apache Beam Java SDK?

Question

In Apache Beam Python SDK, it is possible to perform the following:

input
| GroupBy(account=lambda s: s["account"])
.aggregate_field(lambda x: x["wordsAddup"] - x["wordsSubtract"], sum, 'wordsRead')

How do we perform a similar action in the Java SDK? Strangely, the programming guide has only examples in Python for this transform.

Here is my attempt at producing the equivalent in Java:

input.apply(
Group.byFieldNames("account")
.aggregateField(<INSERT EQUIVALENT HERE>, Sum.ofIntegers(), "wordsRead"));

Answer 1

There are some Java examples at https://beam.apache.org/documentation/programming-guide/#using-schemas . (Note you may have to select the java tab on a selector that has both Java and Python to see them.)

In Java I don't think the first argument of aggregateField can take an arbitrary expression; it must be a field name. You can proceed the grouping operation with a projection that adds a new field for the desired expression. For example

input
    .apply(SqlTransform.query(
        "SELECT *, wordsAddup - wordsSubtract AS wordsDiff from PCOLLECTION")
    .apply(Group.byFieldNames("account")
        .aggregateField("wordsDiff", Sum.ofIntegers(), "wordsRead"));

How to use aggregateField() over multiple columns in Apache Beam Java SDK?

Question

1 answers

solution1
1 ACCPTED 2020-12-11 23:07:12

How to use aggregateField() over multiple columns in Apache Beam Java SDK?

Question

1 answers

solution1 1 ACCPTED 2020-12-11 23:07:12

solution1
1 ACCPTED 2020-12-11 23:07:12