简体   繁体   English

在flink流中使用grok

[英]Using grok in flink streaming

Flink Pipeline is as follows: Flink Pipeline如下:

  1. read messages(string) from kafka topic. 从kafka主题中读取消息(字符串)。
  2. pattern matching through grok converting to json format. 通过grok转换为json格式进行模式匹配。
  3. Aggregations over a time window over extracted field from json. 在json提取的字段上的时间窗口上的聚合。

Below is the code for pattern matching using grok. 下面是使用grok进行模式匹配的代码。

    SingleOutputStreamOperator<JSONObject> mainStream = messageStream.rebalance()
                    .map(new MapFunction<String, JSONObject>() {    
                        private static final long serialVersionUID = 6;

                        @Override
                        public JSONObject map(String value) throws Exception {
                            JSONObject logJson = new JSONObject();  
                            grok.compile(pattern); //pattern is some pattern defined in the class
                            Match gm = grok.match(value);
                            gm.captures();
                            logJson.putAll(gm.toMap());
                            return logJson;
                        }})

In the above code writing grok.compile(pattern) inside the map function works fine. 在上面的代码中,在map函数里面编写grok.compile(pattern)工作正常。 Not doing so gives the following error 不这样做会产生以下错误

The implementation of the MapFunction is not serializable MapFunction的实现不可序列化

Caused by: java.io.NotSerializableException: com.google.code.regexp.Pattern 引起:java.io.NotSerializableException:com.google.code.regexp.Pattern

Is there any way in which I could remove the grok.compile outside the map. 有什么方法可以删除地图外的grok.compile As per my understanding the compilation of the pattern with every message is not required and might create a bottleneck if the no. 根据我的理解,不需要使用每条消息编译模式,如果不是,可能会产生瓶颈。 of messages becomes quite large. 消息变得非常大。

PS: I have imported the package oi.thekraken.grok.api.Grok PS:我已经导入了包oi.thekraken.grok.api.Grok

EDIT: 编辑:

I looked through grok implementation and the Grok class implements Serializable. 我查看了grok实现,Grok类实现了Serializable。 https://github.com/thekrakken/java-grok/blob/master/src/main/java/io/thekraken/grok/api/Grok.java https://github.com/thekrakken/java-grok/blob/master/src/main/java/io/thekraken/grok/api/Grok.java

Your code does not show where the local variable grok comes from, but: 您的代码不显示局部变量grok的来源,但是:

Flink requires all operators to be Serializable because they might be moved around in a cluster. Flink要求所有运算符都是可序列化的,因为它们可能在集群中移动。 This also holds true for all members of operators. 这也适用于所有运营商。 Can you post a complete non-working example? 你能发布一个完整的非工作示例吗? This might make it easier to see where serialization might fail. 这可能会更容易查看序列化可能失败的位置。

More information about flink serialization can be ound in the flink documentation at https://flink.apache.org/faq.html#why-am-i-getting-a-nonserializableexception- and https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/types_serialization.html 有关flink序列化的更多信息,请参阅https://flink.apache.org/faq.html#why-am-i-getting-a-nonserializableexception-https://ci.apache.org/上的flink文档。 项目/弗林克/弗林克-docs的释放-1.2的/ dev / types_serialization.html

Basically, you can register a kryo serializer for custom types or implement (de-)serialization yourself if you need operator members that are not directly serializable. 基本上,您可以为自定义类型注册kryo序列化程序,或者如果您需要不可直接序列化的运算符成员,则可以自行实现(反)序列化。

Btw.: I think you are right in trying to reduce the number of times the pattern is compiled 顺便说一句:我认为你试图减少模式编译的次数是正确的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM