[英]How to create PCollection<Row> from PCollection<String> for performing beam SQL Trasforms
I am trying to implement a Data Pipeline which joins multiple unbounded sources from Kafka topics.我正在尝试实现一个数据管道,它连接来自 Kafka 主题的多个无界源。 I am able to connect to topic and get the data as PCollection<String>
and i need to convert it into PCollection<Row>
.我能够连接到主题并将数据作为PCollection<String>
,我需要将其转换为PCollection<Row>
。 I am splitting the comma delimited string to an array and use schema to convert it as Row.我将逗号分隔的字符串拆分为一个数组,并使用模式将其转换为 Row。 But, How to implement/build schema & bind values to it dynamically?但是,如何实现/构建模式并将值动态绑定到它?
Even if I create a separate class for schema building, is there a way to bind the string array directly to schema?即使我为架构构建创建了一个单独的类,有没有办法将字符串数组直接绑定到架构?
Below is my current working code which is static and needs to be rewritten every time i build a pipeline and it elongates based on the number of fields as well.下面是我当前的工作代码,它是静态的,每次构建管道时都需要重写,并且它也会根据字段的数量进行延长。
final Schema sch1 =
Schema.builder().addStringField("name").addInt32Field("age").build();
PCollection<KafkaRecord<Long, String>> kafkaDataIn1 = pipeline
.apply(
KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("testin1")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(
ImmutableMap.of("group.id", (Object)"test1")));
PCollection<Row> Input1 = kafkaDataIn1.apply(
ParDo.of(new DoFn<KafkaRecord<Long, String>, Row>() {
@ProcessElement
public void processElement(
ProcessContext processContext,
final OutputReceiver<Row> emitter) {
KafkaRecord<Long, String> record = processContext.element();
final String input = record.getKV().getValue();
final String[] parts = input.split(",");
emitter.output(
Row.withSchema(sch1)
.addValues(
parts[0],
Integer.parseInt(parts[1])).build());
}}))
.apply("window",
Window.<Row>into(FixedWindows.of(Duration.standardSeconds(50)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes());
Input1.setRowSchema(sch1);
My Expectation is to achieve the same thing as above code dynamically/reusable way.我的期望是以动态/可重用的方式实现与上述代码相同的事情。
The schema is set on a pcollection so it is not dynamic, if you want to build it lazily, then you need to use a format/coder supporting it.模式设置在 pcollection 上,因此它不是动态的,如果您想懒惰地构建它,那么您需要使用支持它的格式/编码器。 Java serialization or json are examples. Java 序列化或 json 就是例子。
That said to benefit from sql feature you can also use a static schema with querying fields and other fields, this way the static part enables to do you sql and you dont loose additionnal data.据说要从 sql 功能中受益,您还可以使用带有查询字段和其他字段的静态模式,这样静态部分可以执行 sql 并且不会丢失额外的数据。
Romain罗曼
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.