如何创建 PCollection<Row> 来自 PCollection<String> 用于执行梁 SQL 转换

Question

I am trying to implement a Data Pipeline which joins multiple unbounded sources from Kafka topics.我正在尝试实现一个数据管道，它连接来自 Kafka 主题的多个无界源。 I am able to connect to topic and get the data as PCollection<String> and i need to convert it into PCollection<Row> .我能够连接到主题并将数据作为PCollection<String> ，我需要将其转换为PCollection<Row> 。 I am splitting the comma delimited string to an array and use schema to convert it as Row.我将逗号分隔的字符串拆分为一个数组，并使用模式将其转换为 Row。 But, How to implement/build schema & bind values to it dynamically?但是，如何实现/构建模式并将值动态绑定到它？

Even if I create a separate class for schema building, is there a way to bind the string array directly to schema?即使我为架构构建创建了一个单独的类，有没有办法将字符串数组直接绑定到架构？

Below is my current working code which is static and needs to be rewritten every time i build a pipeline and it elongates based on the number of fields as well.下面是我当前的工作代码，它是静态的，每次构建管道时都需要重写，并且它也会根据字段的数量进行延长。

final Schema sch1 =
                Schema.builder().addStringField("name").addInt32Field("age").build();

PCollection<KafkaRecord<Long, String>> kafkaDataIn1 = pipeline
  .apply(
    KafkaIO.<Long, String>read()
      .withBootstrapServers("localhost:9092")
      .withTopic("testin1")
      .withKeyDeserializer(LongDeserializer.class)
      .withValueDeserializer(StringDeserializer.class)
      .updateConsumerProperties(
         ImmutableMap.of("group.id", (Object)"test1")));

PCollection<Row> Input1 = kafkaDataIn1.apply(
  ParDo.of(new DoFn<KafkaRecord<Long, String>, Row>() {
    @ProcessElement
    public void processElement(
        ProcessContext processContext,
        final OutputReceiver<Row> emitter) {

          KafkaRecord<Long, String> record = processContext.element();
          final String input = record.getKV().getValue();

          final String[] parts = input.split(",");

          emitter.output(
            Row.withSchema(sch1)
               .addValues(
                   parts[0],
                   Integer.parseInt(parts[1])).build());
        }}))
  .apply("window",
     Window.<Row>into(FixedWindows.of(Duration.standardSeconds(50)))
       .triggering(AfterWatermark.pastEndOfWindow())
       .withAllowedLateness(Duration.ZERO)
       .accumulatingFiredPanes());

Input1.setRowSchema(sch1);

My Expectation is to achieve the same thing as above code dynamically/reusable way.我的期望是以动态/可重用的方式实现与上述代码相同的事情。

Answer 1

The schema is set on a pcollection so it is not dynamic, if you want to build it lazily, then you need to use a format/coder supporting it.模式设置在 pcollection 上，因此它不是动态的，如果您想懒惰地构建它，那么您需要使用支持它的格式/编码器。 Java serialization or json are examples. Java 序列化或 json 就是例子。

That said to benefit from sql feature you can also use a static schema with querying fields and other fields, this way the static part enables to do you sql and you dont loose additionnal data.据说要从 sql 功能中受益，您还可以使用带有查询字段和其他字段的静态模式，这样静态部分可以执行 sql 并且不会丢失额外的数据。

Romain罗曼

如何创建 PCollection<Row> 来自 PCollection<String> 用于执行梁 SQL 转换

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-07-07 11:52:57

如何创建 PCollection<Row> 来自 PCollection<String> 用于执行梁 SQL 转换

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-07-07 11:52:57

解决方案1
1 已采纳 2019-07-07 11:52:57