简体   繁体   English

Apache Flink:逐步执行

[英]Apache Flink: stepwise execution

Due to a performance measurement I want to execute my Scala program written for Flink stepwise, ie 由于性能测量,我想逐步执行为Flink编写的Scala程序,即

execute first operator; materialize result;
execute second operator; materialize result;
...

and so on. 等等。 The original code: 原始代码:

var filename = new String("<filename>")
var text = env.readTextFile(filename)
var counts = text.flatMap { _.toLowerCase.split("\\W+") }.map { (_, 1) }.groupBy(0).sum(1)
counts.writeAsText("file://result.txt", WriteMode.OVERWRITE)
env.execute()

So I want the execution of var counts = text.flatMap { _.toLowerCase.split("\\\\W+") }.map { (_, 1) }.groupBy(0).sum(1) to be stepwise. 所以我希望var counts = text.flatMap { _.toLowerCase.split("\\\\W+") }.map { (_, 1) }.groupBy(0).sum(1)执行是逐步的。

Is calling env.execute() after every operator the right way to do it? 在每个操作员正确的方法之后调用env.execute()吗?

Or is writing to /dev/null after every operation, ie calling counts.writeAsText("file:///home/username/dev/null", WriteMode.OVERWRITE) and then calling env.execute() a better alternative? 或者在每次操作后写入/dev/null ,即调用counts.writeAsText("file:///home/username/dev/null", WriteMode.OVERWRITE)然后调用env.execute()更好的替代方案? And does Flink actually have something like a NullSink for that purpose? 为此,Flink实际上有类似NullSink东西吗?

edit: I'm using the Flink Scala Shell on a cluster and setting the application with parallelism=1 for the execution of the above code. 编辑:我在群集上使用Flink Scala Shell并使用parallelism = 1设置应用程序以执行上述代码。

Flink uses pipelined data transfers by default to improve the performance of job execution. Flink默认使用流水线数据传输来提高作业执行的性能。 However, you can also force batch data transfer by calling 但是,您也可以通过调用强制批量数据传输

ExecutionEnvironment env = ...
env.getConfig().setExecutionMode(ExecutionMode.BATCH_FORCED);

This will separate the execution of both operators (unless they are chained). 这将分离两个运算符的执行(除非它们被链接)。 You can get the execution time of each task from the log files or check the web dashboard. 您可以从日志文件中获取每个任务的执行时间,也可以查看Web仪表板。 Note, this will not work for chained operators, ie, operators that have the same parallelism and do not require a network shuffle. 注意,这对于链式运算符(即具有相同并行性且不需要网络混洗的运算符)不起作用。 Also, you should be aware that using batched transfers increases the overall execution time of a program. 此外,您应该知道使用批量传输会增加程序的总体执行时间。 I don't think it is possible to really separate the execution time of operators in a pipelined data processor. 我认为不可能真正区分流水线数据处理器中运算符的执行时间。

Call execute() after each operator will not work because, Flink does not yet support caching of result in memory. 在每个运算符不起作用之后调用execute() ,因为Flink还不支持在内存中缓存结果。 So if you execute operator 2, you will either need to write the result of operator 1 to some persistent storage and read it again or execute operator 1 again. 因此,如果执行运算符2,您将需要将运算符1的结果写入某个持久存储并再次读取它或再次执行运算符1。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM