简体   繁体   中英

How to use aggregateField() over multiple columns in Apache Beam Java SDK?

In Apache Beam Python SDK, it is possible to perform the following:

input
| GroupBy(account=lambda s: s["account"])
.aggregate_field(lambda x: x["wordsAddup"] - x["wordsSubtract"], sum, 'wordsRead')

How do we perform a similar action in the Java SDK? Strangely, the programming guide has only examples in Python for this transform.

Here is my attempt at producing the equivalent in Java:

input.apply(
Group.byFieldNames("account")
.aggregateField(<INSERT EQUIVALENT HERE>, Sum.ofIntegers(), "wordsRead"));

There are some Java examples at https://beam.apache.org/documentation/programming-guide/#using-schemas . (Note you may have to select the java tab on a selector that has both Java and Python to see them.)

In Java I don't think the first argument of aggregateField can take an arbitrary expression; it must be a field name. You can proceed the grouping operation with a projection that adds a new field for the desired expression. For example

input
    .apply(SqlTransform.query(
        "SELECT *, wordsAddup - wordsSubtract AS wordsDiff from PCOLLECTION")
    .apply(Group.byFieldNames("account")
        .aggregateField("wordsDiff", Sum.ofIntegers(), "wordsRead"));

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM