简体   繁体   中英

Apache Flink: stepwise execution

Due to a performance measurement I want to execute my Scala program written for Flink stepwise, ie

execute first operator; materialize result;
execute second operator; materialize result;
...

and so on. The original code:

var filename = new String("<filename>")
var text = env.readTextFile(filename)
var counts = text.flatMap { _.toLowerCase.split("\\W+") }.map { (_, 1) }.groupBy(0).sum(1)
counts.writeAsText("file://result.txt", WriteMode.OVERWRITE)
env.execute()

So I want the execution of var counts = text.flatMap { _.toLowerCase.split("\\\\W+") }.map { (_, 1) }.groupBy(0).sum(1) to be stepwise.

Is calling env.execute() after every operator the right way to do it?

Or is writing to /dev/null after every operation, ie calling counts.writeAsText("file:///home/username/dev/null", WriteMode.OVERWRITE) and then calling env.execute() a better alternative? And does Flink actually have something like a NullSink for that purpose?

edit: I'm using the Flink Scala Shell on a cluster and setting the application with parallelism=1 for the execution of the above code.

Flink uses pipelined data transfers by default to improve the performance of job execution. However, you can also force batch data transfer by calling

ExecutionEnvironment env = ...
env.getConfig().setExecutionMode(ExecutionMode.BATCH_FORCED);

This will separate the execution of both operators (unless they are chained). You can get the execution time of each task from the log files or check the web dashboard. Note, this will not work for chained operators, ie, operators that have the same parallelism and do not require a network shuffle. Also, you should be aware that using batched transfers increases the overall execution time of a program. I don't think it is possible to really separate the execution time of operators in a pipelined data processor.

Call execute() after each operator will not work because, Flink does not yet support caching of result in memory. So if you execute operator 2, you will either need to write the result of operator 1 to some persistent storage and read it again or execute operator 1 again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM