flink: job won't run with higher taskmanager.heap.mb

Question

Simple job: kafka->flatmap->reduce->map .

Job runs ok with default value for taskmanager.heap.mb (512Mb). According to the docs : this value should be as large as possible . Since the machine in question has 96Gb of RAM I set this to 75000 (arbitrary value).

Starting job gives this error:

Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.   
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply$mcV$sp(JobManager.scala:563)   
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:509)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:509)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (Source: Custom Source (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 95b239d1777b2baf728645df9a1c4232 > in sharing group < SlotSharingGroup [772c9ff1cf0b6cb3a361e3352f75fcee, d4f856f13654f424d7c49d0f00f6ecca, 81bb8c4310faefe32f97ebd6baa4c04f, 95b239d1777b2baf728645df9a1c4232] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:255)
at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:686)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
... 8 more

Restore the default value (512) to this parameter and the job runs ok. At 5000 it works -> at 10000 it doesn't.

What did I miss?

Edit: This is more hit-n-miss than I thought. Setting the value to 50000 and resubmitting gives success. In every test, the cluster is stopped and restarted.

Answer 1

What you are probably experiencing is submitting a job before the workers have registered at the master.

A 5GB JVM heap is initialized fast, and the TaskManager can register almost immediately. For a 70GB heap, the JVM takes a while to initialize and boot. Consequently, the worker registers later, and the job cannot be executed when you submit it, due to a lack of workers.

That is also the reason why it works once you re-submit the job.

JVMs are initialized faster, if you start the cluster in "streaming" mode (standalone via start-cluster-streaming.sh), because then at least Flink's internal memory is initialized lazily.

flink: job won't run with higher taskmanager.heap.mb

Question

1 answers

solution1
3 ACCPTED 2015-11-09 09:27:50

flink: job won't run with higher taskmanager.heap.mb

Question

1 answers

solution1 3 ACCPTED 2015-11-09 09:27:50

solution1
3 ACCPTED 2015-11-09 09:27:50