How to scale a Flink Job that consumes a huge topic

Question

The setup:

Flink version 1.12
Deployment on Yarn
Programming language: Scala

Flink job:

Two input kafka topics and one output kafka topic
Input1: is a huge topic between 300K and 500K messages per second. Each message has 600 fields.
Input2: is a small topic about 20K messages per second once per day. Each message has 22 fields.
The goal is to enrich Input1 with Input2 and the output is a kafka topic where every message has 100 fields from Input1 and 13 fields from Input2.
I keep a state from input2 as MapState
I use RichCoMapFunction to do the mapping

This is a snippet from the code where I connect both streams:

 stream1.connect(stream2) .keyBy(_.getKey1,_.getKey2) .map(new RichCoMapFunction)

I use setAutoWatermarkInterval = 300000
No checkPoints or savingPoints are currently used

Flink Configurations:

Number of partitions for Input1 = 120
Number of Partitions for Input2 = 30
Number of partitions for the output topic = 120
Total number of Parallelism = 700
Number of Parallelism for input1 = 120
Number of Parallelism for input2 = 30

Join Parallelism:700 (Number of parallelism to connect both stream. This is set as following:

 stream1.connect(stream2) .keyBy(_.getKey1,_.getKey2) .map(new RichCoMapFunction) .setParallelism(700)

jobManagerMemoryFlinkSize:4096m
taskManagerMemoryFlinkSize:3072m
taskManagerMemoryManagedSize:1b
clusterEvenlySpreadOutSlots:true
akkaThroughput:1500

Yarn Configurations:

yarnSlots = 4
yarnjobManagerMemory = 5120m
yarntaskManagerMemory = 4096m
Total Number of Task Slots = 700
Number of Task Managers = 175

Problem: The latency on the output topic is around 30min which is unacceptable for our use case. I tried many other Flink configurations related to Memory allocations and vCores but it didn't help. It would be great if you have any suggestions on how can we scale to reach higher throughput and lower latency.

EDIT1: The RichCoMapFunction code:

class Stream1WithStream2CoMapFunction extends RichCoMapFunction[Input1, Input2, Option[Output]] {

  private var input2State: MapState[Long, Input2] = _

  override def open(parameters: Configuration): Unit = {
    val ttlConfig = StateTtlConfig
      .newBuilder(org.apache.flink.api.common.time.Time.days(3))
      .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
      .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
      .build()

    val mapStateDescriptor = new MapStateDescriptor[Long, Input2]("input2State", classOf[Long], classOf[Input2])
    mapStateDescriptor.enableTimeToLive(ttlConfig)
    input2State = getRuntimeContext.getMapState(mapStateDescriptor)
  }

  override def map1(value: Input1): Option[Output] = {
    // Create a new object of type Output (enrich input1 with fields from input2 from the state)
  }

  override def map2(value: Input2): Option[Output] = {
    // Put the value in the input2State
  }
}

Answer 1

You could use a profiler (or the flame graphs added to Flink 1.13) to try to diagnose why this is running slowly. The backpressure/busy monitoring added in Flink 1.13 would also be helpful.

But my guess is that tremendous effort is going into serde. If you aren't already doing so, you should eliminate all unnecessary fields from stream1 as early as possible in the pipeline, so that the data that won't be used never has to be serialized. For a first pass, you could do this in a map operator chained to the source (at the same parallelism as the source), but a custom serializer will ultimately yield better performance.

You haven't mentioned the sink, but sinks are often a culprit in these situations. I assume it's Kafka (since you mentioned the output topic), and I'm assume you're not using Kafka transactions (since checkpointing is disabled). But how is the sink configured?

Why have you set the AutoWatermarkInterval to 300000 if your job isn't using watermarks? If you are using watermarks somewhere, this will add up to 5 minutes of latency. If you're not using watermarks, this setting is meaningless.

And why have you set akkaThroughput: 1500 ? This looks suspicious. I would experiment with resetting this to the default value (15).

Is there any other custom tuning, such as network buffering? I would call into question all non-default configuration settings (though I'm sure some are justified, like memory).

I would also set the parallelism for the whole job to a uniform value, eg, 700. Fine-tuning individual stages of the pipeline is rarely helpful, and can be harmful.

How have you set maxParallelism? I would set it to something like 2800 or 3500 so that you have at least 4 or 5 key groups per slot.

Could it be that a few instances are doing most of the work? You can examine the metrics on the various sub-tasks of the RichCoMapFunction and look for skew. Eg, look at numRecordsInPerSecond.

How to scale a Flink Job that consumes a huge topic

Question

1 answers

solution1
0 2021-10-22 19:08:15

How to scale a Flink Job that consumes a huge topic

Question

1 answers

solution1 0 2021-10-22 19:08:15

solution1
0 2021-10-22 19:08:15