简体   繁体   English

Flink DataStream 如何将一个自定义的 POJO 组合成另一个 DataStream

[英]Flink How do DataStream combine a custom POJO into another DataStream

I want convert a DataStream to DataStream with schema info我想将 DataStream 转换为带有架构信息的 DataStream

input输入

args[0] DataStream args[0] 数据流

{"fields":["China","Beijing"]}

args[1] schema args[1] 架构

message spark_schema {
  optional binary country (UTF8);
  optional binary city (UTF8);
}

expect output期待输出

{"country":"china", "city":"beijing"}

my code like this我的代码是这样的

public DataStream<String> convert(DataStream source, MessageType messageType) {

        SingleOutputStreamOperator<String> dataWithSchema = source.map((MapFunction<Row, String>) row -> {
            JSONObject data = new JSONObject();
            this.fields = messageType.getFields().stream().map(Type::getName).collect(Collectors.toList());
            for (int i = 0; i < fields.size(); i++) {
                data.put(fields.get(i), row.getField(i));
            }
            return data.toJSONString();
        });
        return dataWithSchema;
    }

Exception Errors异常错误

Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: Object com.xxxx.ParquetDataSourceReader$$Lambda$64/1174881426@d78795 is not serializable
    at org.apache.flink.api.java.ClosureCleaner.ensureSerializable(ClosureCleaner.java:180)
    at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.clean(StreamExecutionEnvironment.java:1823)
    at org.apache.flink.streaming.api.datastream.DataStream.clean(DataStream.java:188)
    at org.apache.flink.streaming.api.datastream.DataStream.map(DataStream.java:590)

But the code below works fine但下面的代码工作正常

public DataStream<String> convert(DataStream source, MessageType messageType) {
        if (this.fields == null) {
            throw new RuntimeException("The schema of AbstractRowStreamReader is null");
        }

        List<String> field = messageType.getFields().stream().map(Type::getName).collect(Collectors.toList());
        SingleOutputStreamOperator<String> dataWithSchema = source.map((MapFunction<Row, String>) row -> {
            JSONObject data = new JSONObject();
            for (int i = 0; i < field.size(); i++) {
                data.put(field.get(i), row.getField(i));
            }
            return data.toJSONString();
        });
        return dataWithSchema;
    }

The Flink map operator how to combine a external complex POJO? Flink map operator 如何结合外部复杂的 POJO?

For Flink to distribute the code across tasks, the code needs to be completely Serializable .为了让 Flink 跨任务分发代码,代码需要完全Serializable In your first example, it isn't;在你的第一个例子中,它不是; in the second it is.在第二个是。 In particular, Type::getName will generate a lambda that is not Serializable .特别是, Type::getName将生成一个不是Serializable的 lambda。

To get a lambda that is Serializable , you need to explicitly cast it to a serializable interface (eg Flink MapFunction ) or use cast it with (Serializable & Function)要获得可Serializable的 lambda,您需要将其显式转换为可序列化接口(例如 Flink MapFunction )或将其与(Serializable & Function)

Since the second one is also saving computations, it would be better in any case.由于第二个也在节省计算,因此无论如何都会更好。 Convert will be executed only once during job compilation, while DataStream#map is called for each record. Convert 将在作业编译期间只执行一次,而DataStream#map会为每条记录调用。 If that is not clear, I recommend to execute it in an IDE and play around with breakpoints.如果这不清楚,我建议在 IDE 中执行它并使用断点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Flink DataStream 元素未更新 - Flink DataStream element not updating 如何为 Flink DataStream 执行简单的中值算法(最好在 Java 和 Flink 1.14 中)? - How do I perform a simple median algorithm for a Flink DataStream (preferably in Java and Flink 1.14)? 使用 DataStream API 进行批处理的 Flink Consumer - 我们如何知道何时停止以及如何停止处理 [2 折] - Flink Consumer with DataStream API for Batch Processing - How do we know when to stop & How to stop processing [ 2 fold ] Flink - 将 Avro 数据流转换为表 - Flink - Convert Avro datastream to table Apache Flink:如何计算DataStream中的事件总数 - Apache Flink: How to count the total number of events in a DataStream 如何使用Java在Apache Flink中对DataStream执行平均操作 - How to perform average operation on DataStream in Apache Flink using Java Flink DataStream-如何从输入元素启动源? - Flink DataStream - how to start a source from an input element? DataStream上的Flink SQL查询(Apache Flink Java) - Flink sql Query on DataStream (Apache Flink Java) 处理两个 DataStream<string> 同时,要在 flink 中找到一个 DataStream 包含来自其他 DataStream 的值?</string> - Precessing two DataStream<String> simultaneously ,to find one DataStream contains values from other DataStream in flink? 使用Flink DataStream计算窗口持续时间的平均值 - Calculate average using Flink DataStream for a window duration
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM