简体   繁体   中英

flink SourceFunction<> is being replaced in StreamExecutionEnvironment.addSource()?

I ran into this problem when I was trying to create a custom source of event. Which contains a queue that allow my other process to add items into it. Then expect my CEP pattern to print some debug messages when there is a match.

But there is no match no matter what I add to the queue. Then I notice that the queue inside mySource.run() is always empty. Which means the queue I used to create the mySource instance is not the same as the one inside StreamExecutionEnvironment . If I change the queue to static, force all instances to share the same queue, everything works as expected.

DummySource.java

    public class DummySource implements SourceFunction<String> {

    private static final long serialVersionUID = 3978123556403297086L;
//  private static Queue<String> queue = new LinkedBlockingQueue<String>();
    private Queue<String> queue;
    private boolean cancel = false;

    public void setQueue(Queue<String> q){
        queue = q;
    }   

    @Override
    public void run(org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext<String> ctx)
            throws Exception {
        System.out.println("run");
        synchronized (queue) {          
            while (!cancel) {
                if (queue.peek() != null) {
                    String e = queue.poll();
                    if (e.equals("exit")) {
                        cancel();
                    }
                    System.out.println("collect "+e);
                    ctx.collectWithTimestamp(e, System.currentTimeMillis());
                }
            }
        }
    }

    @Override
    public void cancel() {
        System.out.println("canceled");
        cancel = true;
    }
}

So I dig into the source code of StreamExecutionEnvironment . Inside the addSource() method. There is a clean() method which looks like it replaces the instance to a new one.

Returns a "closure-cleaned" version of the given function.

Why is that? and Why it needs to be serialize? I've also try to turn off the clean closure using getConfig(). The result is still the same. My queue instance is not the same one which env is using.

How do I solve this problem?

The clean() method used on functions in Flink is mainly to ensure the Function (like SourceFunction, MapFunction) serialisable. Flink will serialise those functions and distribute them onto task nodes to execute them.

For simple variables in your Flink main code, like int, you can simply reference them in your function. But for the large or not-serialisable ones, better using broadcast and rich source function. Please refer to https://cwiki.apache.org/confluence/display/FLINK/Variables+Closures+vs.+Broadcast+Variables

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM