简体   繁体   English

如何创建 PCollection<Row> 来自 PCollection<String> 用于执行梁 SQL 转换

[英]How to create PCollection<Row> from PCollection<String> for performing beam SQL Trasforms

I am trying to implement a Data Pipeline which joins multiple unbounded sources from Kafka topics.我正在尝试实现一个数据管道,它连接来自 Kafka 主题的多个无界源。 I am able to connect to topic and get the data as PCollection<String> and i need to convert it into PCollection<Row> .我能够连接到主题并将数据作为PCollection<String> ,我需要将其转换为PCollection<Row> I am splitting the comma delimited string to an array and use schema to convert it as Row.我将逗号分隔的字符串拆分为一个数组,并使用模式将其转换为 Row。 But, How to implement/build schema & bind values to it dynamically?但是,如何实现/构建模式并将值动态绑定到它?

Even if I create a separate class for schema building, is there a way to bind the string array directly to schema?即使我为架构构建创建了一个单独的类,有没有办法将字符串数组直接绑定到架构?

Below is my current working code which is static and needs to be rewritten every time i build a pipeline and it elongates based on the number of fields as well.下面是我当前的工作代码,它是静态的,每次构建管道时都需要重写,并且它也会根据字段的数量进行延长。

final Schema sch1 =
                Schema.builder().addStringField("name").addInt32Field("age").build();

PCollection<KafkaRecord<Long, String>> kafkaDataIn1 = pipeline
  .apply(
    KafkaIO.<Long, String>read()
      .withBootstrapServers("localhost:9092")
      .withTopic("testin1")
      .withKeyDeserializer(LongDeserializer.class)
      .withValueDeserializer(StringDeserializer.class)
      .updateConsumerProperties(
         ImmutableMap.of("group.id", (Object)"test1")));

PCollection<Row> Input1 = kafkaDataIn1.apply(
  ParDo.of(new DoFn<KafkaRecord<Long, String>, Row>() {
    @ProcessElement
    public void processElement(
        ProcessContext processContext,
        final OutputReceiver<Row> emitter) {

          KafkaRecord<Long, String> record = processContext.element();
          final String input = record.getKV().getValue();

          final String[] parts = input.split(",");

          emitter.output(
            Row.withSchema(sch1)
               .addValues(
                   parts[0],
                   Integer.parseInt(parts[1])).build());
        }}))
  .apply("window",
     Window.<Row>into(FixedWindows.of(Duration.standardSeconds(50)))
       .triggering(AfterWatermark.pastEndOfWindow())
       .withAllowedLateness(Duration.ZERO)
       .accumulatingFiredPanes());

Input1.setRowSchema(sch1);

My Expectation is to achieve the same thing as above code dynamically/reusable way.我的期望是以动态/可重用的方式实现与上述代码相同的事情。

The schema is set on a pcollection so it is not dynamic, if you want to build it lazily, then you need to use a format/coder supporting it.模式设置在 pcollection 上,因此它不是动态的,如果您想懒惰地构建它,那么您需要使用支持它的格式/编码器。 Java serialization or json are examples. Java 序列化或 json 就是例子。

That said to benefit from sql feature you can also use a static schema with querying fields and other fields, this way the static part enables to do you sql and you dont loose additionnal data.据说要从 sql 功能中受益,您还可以使用带有查询字段和其他字段的静态模式,这样静态部分可以执行 sql 并且不会丢失额外的数据。

Romain罗曼

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何转换 PCollection<tablerow> 到个人收藏<row>在 Apache 梁?</row></tablerow> - How to convert PCollection<TableRow> to PCollection<Row> in Apache Beam? 如何从 PCollection 中提取信息<row>加入 apache 光束后?</row> - How to extract information from PCollection<Row> after a join in apache beam? 如何从 PCollection 获取所有文件元数据<string>在光束中</string> - How to get all file metadata from PCollection<string> in beam 如何为 PCollection 设置编码器<List<String> &gt; 在 Apache Beam 中? - How do I set the coder for a PCollection<List<String>> in Apache Beam? 如何转换 PCollection<row> 在数据流 Apache 中使用 Java 束</row> - How to convert PCollection<Row> to Long in Dataflow Apache beam using Java 如何转换 PCollection<row> 使用 Java 到数据流 Apache 中的 Integer</row> - How to convert PCollection<Row> to Integer in Dataflow Apache beam using Java 如何在 PCollection 中组合数据 - Apache Beam - How to combine Data in PCollection - Apache beam 如何区分两个 PCollection Apache Beam - How to diff two PCollection Apache Beam 如何根据 PCollection 的大小编写 Beam 条件 - How to write a Beam condition based on the size of a PCollection 如何在处理PCollection中的元素时将元素发布到kafka主题 <KV<String,String> &gt;在apache梁中的ParDo功能? - How to publish elements to a kafka topic while processing the elements in the PCollection<KV<String,String>> in ParDo function in apache beam?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM