如何在 Apache Beam Java SDK 中的多列上使用aggregateField（）？

Question

在 Apache 光束 Python SDK 中，可以執行以下操作：

input
| GroupBy(account=lambda s: s["account"])
.aggregate_field(lambda x: x["wordsAddup"] - x["wordsSubtract"], sum, 'wordsRead')

我們如何在 Java SDK 中執行類似的操作？ 奇怪的是，編程指南中只有 Python中用於此轉換的示例。

這是我在 Java 中生成等效項的嘗試：

input.apply(
Group.byFieldNames("account")
.aggregateField(<INSERT EQUIVALENT HERE>, Sum.ofIntegers(), "wordsRead"));

Answer 1

在https://beam.apache.org/documentation/programming-guide/#using-schemas中有一些 Java 示例。 (Note you may have to select the java tab on a selector that has both Java and Python to see them.)

在 Java 中，我認為 aggregateField 的第一個參數不能采用任意表達式； 它必須是字段名稱。 您可以使用為所需表達式添加新字段的投影來繼續分組操作。 例如

input
    .apply(SqlTransform.query(
        "SELECT *, wordsAddup - wordsSubtract AS wordsDiff from PCOLLECTION")
    .apply(Group.byFieldNames("account")
        .aggregateField("wordsDiff", Sum.ofIntegers(), "wordsRead"));

如何在 Apache Beam Java SDK 中的多列上使用aggregateField（）？

問題描述

1 個解決方案

解決方案1
1 已采納 2020-12-11 23:07:12

如何在 Apache Beam Java SDK 中的多列上使用aggregateField（）？

問題描述

1 個解決方案

解決方案1 1 已采納 2020-12-11 23:07:12

解決方案1
1 已采納 2020-12-11 23:07:12