如何從 PCollection 中提取信息<row>加入 apache 光束后？</row>

Question

我有兩個示例數據流，我對其執行 innerJoin。 我想擴展這段示例連接代碼並在連接發生后添加一些邏輯

public class JoinExample {

  public static void main(String[] args) {
    final Pipeline pipeline = Pipeline.create(pipelineOpts);

    PCollection<Row> adStream =
        pipeline
            .apply(From.source("kafka.adStream"))
            .apply(Select.fieldNames("ad.id", "ad.name"))
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));

    PCollection<Row> clickStream =
        pipeline
            .apply(From.source("kafka.clickStream"))
            .apply(Select.fieldNames("ad.id", "numClicks"))
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));

    adStream
        .apply(Join.<Row, Row>innerJoin(clickStream).using("id"))
        .apply(ConsoleOutput.of(Row::toString)); // Instead of this output, I would like to just print the ad name and num clicks after the join

    pipeline.run();
  }

我想在加入后使用這樣的 DoFcn 打印廣告名稱和點擊次數：

 adStream
    .apply(Join.<Row, Row>innerJoin(clickStream).using("id"))
    .apply(ParDo.of(new DoFcn(PCollection<Row>, int>() {

      public void processElement(ProcessContext c) {
        // Since there are two rows after the join, how can I get info from each row?
        // Example in:
        //    ad.id = 1, ad.name = test
        //    ad.id = 1, numClicks = 1000
        
        // After join
        // Row: [Row:[1, test], Row:[1, 1000]]
        
        // I tried this statement but it is incorrect
        Row one = c.element.getRow(0);  // This API is not available
      }
     }

關於如何從連接的數據中提取此信息的任何想法？

Answer 1

如您所知，Schema Join 方法模擬 SQL 聯接，其中聯接的結果是聯接的 PCollections 中的行的串聯。 為了查看哪些行進入內部聯接，您必須使用CoGroup實用程序來聯接 PCollections。 這將返回一個Row object，其中包含與鍵匹配的Row的每個 PCollections 的單獨迭代。 例子：


import org.apache.beam.sdk.schemas.transforms.CoGroup;
import org.apache.beam.sdk.values.PCollectionTuple;

public class JoinExample {

  public static void main(String[] args) {
    final Pipeline pipeline = Pipeline.create(pipelineOpts);

    PCollection<Row> adStream =
        pipeline
            .apply(From.source("kafka.adStream"))
            .apply(Select.fieldNames("ad.id", "ad.name"))
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));

    PCollection<Row> clickStream =
        pipeline
            .apply(From.source("kafka.clickStream"))
            .apply(Select.fieldNames("ad.id", "numClicks"))          
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));

    // The names given here for the PCollections can be used to retrieve the
    // the rows in the consuming PTransform. See below:
    PCollectionTuple.of("adStream", adStream, "clickStream", clickStream)
      // This selects the common field name in both adStream and clickStream 
      // to join on. See the documentation for ways of joining on
      // different keys.
      .apply(CoGroup.join(By.fieldNames("id")))
      .apply(ParDo.of(new DoFn<Row, int>() {
        public void processElement(ProcessContext c) 

          // Get key.
          String id = c.element.getValue("key").id;

          // Get rows from the adStream and clickStream PCollections that 
          // share the same id.
          Iterable<Row> adStream = c.element.getValue("adStream");
          Iterable<Row> clickStream = c.element.getValue("clickStream");

          return 0;
        }
      }));

     pipeline.run();
  }
}

如何從 PCollection 中提取信息<row>加入 apache 光束后？</row>

問題描述

1 個解決方案

解決方案1
1 已采納 2022-03-14 23:08:17

如何從 PCollection 中提取信息<row>加入 apache 光束后？</row>

問題描述

1 個解決方案

解決方案1 1 已采納 2022-03-14 23:08:17

解決方案1
1 已采納 2022-03-14 23:08:17