简体   繁体   English

如何从 PCollection 中提取信息<row>加入 apache 光束后?</row>

[英]How to extract information from PCollection<Row> after a join in apache beam?

I have two example streams of data on which I perform innerJoin.我有两个示例数据流,我对其执行 innerJoin。 I would like to extend this piece of example join code and add some logic after the join occurs我想扩展这段示例连接代码并在连接发生后添加一些逻辑

public class JoinExample {

  public static void main(String[] args) {
    final Pipeline pipeline = Pipeline.create(pipelineOpts);

    PCollection<Row> adStream =
        pipeline
            .apply(From.source("kafka.adStream"))
            .apply(Select.fieldNames("ad.id", "ad.name"))
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));

    PCollection<Row> clickStream =
        pipeline
            .apply(From.source("kafka.clickStream"))
            .apply(Select.fieldNames("ad.id", "numClicks"))
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));

    adStream
        .apply(Join.<Row, Row>innerJoin(clickStream).using("id"))
        .apply(ConsoleOutput.of(Row::toString)); // Instead of this output, I would like to just print the ad name and num clicks after the join

    pipeline.run();
  }

I would like to just print the ad name and num clicks after the join using a DoFcn like this:我想在加入后使用这样的 DoFcn 打印广告名称和点击次数:

 adStream
    .apply(Join.<Row, Row>innerJoin(clickStream).using("id"))
    .apply(ParDo.of(new DoFcn(PCollection<Row>, int>() {

      public void processElement(ProcessContext c) {
        // Since there are two rows after the join, how can I get info from each row?
        // Example in:
        //    ad.id = 1, ad.name = test
        //    ad.id = 1, numClicks = 1000
        
        // After join
        // Row: [Row:[1, test], Row:[1, 1000]]
        
        // I tried this statement but it is incorrect
        Row one = c.element.getRow(0);  // This API is not available
      }
     } 

Any ideas on how to extract this info from the joined data?关于如何从连接的数据中提取此信息的任何想法?

As you learned, the Schema Join method emulates the SQL join in which the result of the join is the concatenation of the rows from the joined PCollections.如您所知,Schema Join 方法模拟 SQL 联接,其中联接的结果是联接的 PCollections 中的行的串联。 In order to see which rows went into the inner join you have to use the CoGroup utility to join the PCollections.为了查看哪些行进入内部联接,您必须使用CoGroup实用程序来联接 PCollections。 This returns a Row object with individual iterables for each of the PCollections that contains Row s that match the key.这将返回一个Row object,其中包含与键匹配的Row的每个 PCollections 的单独迭代。 Example:例子:


import org.apache.beam.sdk.schemas.transforms.CoGroup;
import org.apache.beam.sdk.values.PCollectionTuple;

public class JoinExample {

  public static void main(String[] args) {
    final Pipeline pipeline = Pipeline.create(pipelineOpts);

    PCollection<Row> adStream =
        pipeline
            .apply(From.source("kafka.adStream"))
            .apply(Select.fieldNames("ad.id", "ad.name"))
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));

    PCollection<Row> clickStream =
        pipeline
            .apply(From.source("kafka.clickStream"))
            .apply(Select.fieldNames("ad.id", "numClicks"))          
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));

    // The names given here for the PCollections can be used to retrieve the
    // the rows in the consuming PTransform. See below:
    PCollectionTuple.of("adStream", adStream, "clickStream", clickStream)
      // This selects the common field name in both adStream and clickStream 
      // to join on. See the documentation for ways of joining on
      // different keys.
      .apply(CoGroup.join(By.fieldNames("id")))
      .apply(ParDo.of(new DoFn<Row, int>() {
        public void processElement(ProcessContext c) 

          // Get key.
          String id = c.element.getValue("key").id;

          // Get rows from the adStream and clickStream PCollections that 
          // share the same id.
          Iterable<Row> adStream = c.element.getValue("adStream");
          Iterable<Row> clickStream = c.element.getValue("clickStream");

          return 0;
        }
      }));

     pipeline.run();
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何转换 PCollection<tablerow> 到个人收藏<row>在 Apache 梁?</row></tablerow> - How to convert PCollection<TableRow> to PCollection<Row> in Apache Beam? 如何转换 PCollection<row> 在数据流 Apache 中使用 Java 束</row> - How to convert PCollection<Row> to Long in Dataflow Apache beam using Java 如何转换 PCollection<row> 使用 Java 到数据流 Apache 中的 Integer</row> - How to convert PCollection<Row> to Integer in Dataflow Apache beam using Java 如何创建 PCollection<Row> 来自 PCollection<String> 用于执行梁 SQL 转换 - How to create PCollection<Row> from PCollection<String> for performing beam SQL Trasforms 如何区分两个 PCollection Apache Beam - How to diff two PCollection Apache Beam 如何在 PCollection 中组合数据 - Apache Beam - How to combine Data in PCollection - Apache beam Beam / Dataflow 2.2.0-从pcollection中提取前n个元素 - Beam/Dataflow 2.2.0 - extract first n elements from pcollection 如何使用 Apache Beam 中的流输入 PCollection 请求 Redis 服务器? - How to request Redis server using a streaming input PCollection in Apache Beam? 如何为 PCollection 设置编码器<List<String> &gt; 在 Apache Beam 中? - How do I set the coder for a PCollection<List<String>> in Apache Beam? 如何将 JSON Array 反序列化为 Apache beam PCollection<javaobject></javaobject> - How to deserialize JSON Array to Apache beam PCollection<javaObject>
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM