简体   繁体   English

Flink - 将 Avro 数据流转换为表

[英]Flink - Convert Avro datastream to table

I have messages in Avro format in Kafka.我在 Kafka 中有 Avro 格式的消息。 These have to converted to table and selected using SQL, then converted to stream and finally sink.这些必须转换为表格并使用SQL进行选择,然后转换为stream,最后下沉。 There are multiple Kafka topics with different Avro schemas, hence dynamic tables are required.有多个具有不同 Avro 模式的 Kafka 主题,因此需要动态表。

Here is the code which I am using这是我正在使用的代码

StreamExecutionEnvironment env = ...;
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

FlinkKafkaConsumer<MyAvroClass> kafkaConsumer = ...;
var kafkaInputStream = env.addSource(kafkaConsumer, "kafkaInput");

Table table = tableEnv.fromDataStream(kafkaInputStream);
tableEnv.executeSql("DESCRIBE " + table).print();
...

MyAvroClass is Avro class which extends SpecificRecordBase and contains an array. MyAvroClass是 Avro class,它扩展了SpecificRecordBase并包含一个数组。
code for this class.此 class 的代码。

public class MyAvroClass extends SpecificRecordBase implements SpecificRecord {
  // avro fields
  private String event_id;
  private User user;
  private List<Item> items; 
  
  // getter, setters, constructors, builders, ...
}

I am unable to access elements of items field.我无法访问items字段的元素。 When I print table description, I see that items is of type ANY当我打印表描述时,我看到项目是 ANY 类型

+------------+-------------------------------------------------------------+------+-----+--------+-----------+
|       name |                                                        type | null | key | extras | watermark |
+------------+-------------------------------------------------------------+------+-----+--------+-----------+
|   event_id |                                                      STRING | true |     |        |           |
|      items |                        LEGACY('RAW', 'ANY<java.util.List>') | true |     |        |           |
|       user |  LEGACY('STRUCTURED_TYPE', 'POJO<com.company.events.User>') | true |     |        |           |
+------------+-------------------------------------------------------------+------+-----+--------+-----------+  

How can I convert it to a type using which I can query the from items?如何将其转换为可用于查询 from 项目的类型? Thanks in advance提前致谢

I'm currently using this method for the purpose.我目前正在为此目的使用这种方法。

public static <T extends SpecificRecord> Table toTable(StreamTableEnvironment tEnv,
                                                       DataStream<T> dataStream,
                                                       Class<T> cls) {
  RichMapFunction<T, Row> avroSpecific2RowConverter = new RichMapFunction<>() {
    private transient AvroSerializationSchema<T> avro2bin = null;
    private transient AvroRowDeserializationSchema bin2row = null;

    @Override
    public void open(Configuration parameters) throws Exception {
      avro2bin = AvroSerializationSchema.forSpecific(cls);
      bin2row = new AvroRowDeserializationSchema(cls);
    }

    @Override
    public Row map(T value) throws Exception {
      byte[] bytes = avro2bin.serialize(value);
      Row row = bin2row.deserialize(bytes);
      return row;
    }
  };

  SingleOutputStreamOperator<Row> rows = dataStream.map(avroSpecific2RowConverter)
    // https://issues.apache.org/jira/browse/FLINK-23885
    .returns(AvroSchemaConverter.convertToTypeInfo(cls));

  return tEnv.fromDataStream(rows);
}

I am experiencing a similar problem, where Flink Table's type interpolation failed to ingest java.util.List or java.util.Map, despite officially it's supported.我遇到了类似的问题,尽管官方支持 Flink Table 的类型插值无法获取 java.util.List 或 java.util.Map。 I found a workaround (read: HACK) I'd like to share.我找到了一个解决方法(阅读:HACK)我想分享。

Step 1: When mapping your data to POJO, stick with fields you KNOW will interpolate correctly.第 1 步:将数据映射到 POJO 时,坚持使用您知道会正确插值的字段。 In my case I had Map<String, String> that was failing with interpolation to LEGACY('RAW', ANY<java.util.Map>) .在我的例子中,我有 Map<String, String> 插值失败LEGACY('RAW', ANY<java.util.Map>) I joined it into a single String (eg, comma separated entries, where each entry is 'key:value'. They are joined into a single string).我将它连接成一个字符串(例如,逗号分隔的条目,其中每个条目都是“key:value”。它们连接成一个字符串)。

Step 2: For your input data stream, make sure to transform it into DataStream[MY_POJO_TYPE].第 2 步:对于您的输入数据 stream,确保将其转换为 DataStream[MY_POJO_TYPE]。

Step 3: Go ahead and do Table table = tableEnv.fromDataStream(kafkaInputStream);第 3 步:Go 提前执行Table table = tableEnv.fromDataStream(kafkaInputStream); as usual.照常。

Step 4: Perform another transform on table with ScalarFunction .第 4 步:使用ScalarFunction对表执行另一个转换。 In my case, I wrote an user-defined scalar function that takes the String, and output Map<String, String>.在我的例子中,我编写了一个用户定义的标量 function 和 output Map<String, String>。 Strangely enough, when interpolating Map AFTER data is in Table abstraction, Flink was able to interpolate the type properly into Flink MAP type.奇怪的是,当在表抽象中插入 Map 数据时,Flink 能够将类型正确地插入到 Flink MAP 类型中。

Here's a rough example of what the user defined scalar function looks like (in java):这是用户定义的标量 function 的粗略示例(在 Java 中):

import java.util.Arrays;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.flink.table.functions.ScalarFunction;

public class TagsMapTypeScalarFunction extends ScalarFunction {

  // See
  // https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#scalar-functions for reference implementation and how interfacing with ScalarFunction works.
  public Map<String, String> eval(String s) {
    // input is comma delimited key:value pairs.
    return Arrays.stream(s.split(","))
        .filter(kv -> kv != "")
        .map(kv -> kv.split(":"))
        .filter(pair -> pair.length == 2)
        .filter(pair -> Arrays.stream(pair).allMatch(token -> token != ""))
        .collect(Collectors.toMap(pair -> pair[0].trim(), pair -> pair[1].trim()));
  }
}

Here's what the invocation roughly looks like (in Scala):这是调用的大致样子(在 Scala 中):


    //This table has a field "tags" which is the comma-delimited, key:value squished string.
    val transformedTable = tableEnv.fromDataStream(kafkaInputStream: DataStream[POJO])

    tableEnv.createTemporaryFunction(
      "TagsMapTypeScalarFunction",
      classOf[TagsMapTypeScalarFunction]
    );

    val anotherTransform =
      transformedTable
        .select($"*", call("TagsMapTypeScalarFunction", $"tags").as("replace_tags"))
        .dropColumns($"tags")
        .renameColumns($"replace_tags".as("tags"))

    anotherTransform

It certainly is a bit of "busy" work converting from map, to string, and back out as a map. But it beats being stuck.从 map 转换为字符串,然后返回为 map 确实是一项“繁忙”的工作。但总比被卡住好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM