简体   繁体   中英

How to enrich event stream with big file in Apache Flink?

I have a Flink application for click stream collection and processing. The application consists of Kafka as event source, a map function and a sink as image shown below:

Flink,庞大的静态丰富数据

I want to enrich the incoming click stream data with user's IP location based on userIp field in raw event ingested from Kafka.

a simplified slice of the CSV file as shown below

   start_ip,end_ip,country
   "1.1.1.1","100.100.100.100","United States of America"
   "100.100.100.101","200.200.200.200","China"

I have made some researches and found a couple of potential solutions:

1. Solution: Broadcast the enrichment data and connect with event stream with some IP matching logic.

1. Result: It worked well for a couple sample IP location data but not with whole CSV data. JVM heap has reached to 3.5 GB and due to the broadcast state, there is no way to put the broadcast state into disk (for RocksDb)

2. Solution: Load CSV data in open() method in RichFlatMapFunction into the state(ValueState) before start of the event processing and enrich event data in flatMap method.

2. Result: Due to the enrichment data is so big to store in JVM heap, it's impossible to load into ValueState. And also de/serializing through ValueState is bad practice for data in key-value nature.

3. Solution: To avoid to deal with JVM heap constraint, I have tried to put the enrichment data into RocksDB(uses disk) as state with MapState.

3. Result: Trying to load the CSV file into MapState in open() method, gave me error that tells me you cannot put into MapState in open() method because I was not in keyed context in open() method like this question: Flink keyed stream key is null

4. Solution: Because of need of the keyed context for MapState(to put RocksDB), I tried to load whole CSV file into local RocksDB instance(disk) in the process function after making the DataStream into KeyedStream:

class KeyedIpProcess extends KeyedProcessFunction[Long, Event, Event] {

  var ipMapState: MapState[String, String] = _
  var csvFinishedFlag: ValueState[Boolean] = _

  override def processElement(event: Event,
                              ctx: KeyedProcessFunction[Long, Event, Event]#Context,
                              out: Collector[Event]): Unit = {

    val ipDescriptor = new MapStateDescriptor[String, String]("ipMapState", classOf[String], classOf[String])
    val csvFinishedDescriptor = new ValueStateDescriptor[Boolean]("csvFinished", classOf[Boolean])

    ipMapState = getRuntimeContext.getMapState(ipDescriptor)
    csvFinishedFlag = getRuntimeContext.getState(csvFinishedDescriptor)

    if (!csvFinishedFlag.value()) {
      val csv = new CSVParser(defaultCSVFormat)

      val fileSource = Source.fromFile("/tmp/ip.csv", "UTF-8")
      for (row <- fileSource.getLines()) {
        val Some(List(start, end, country)) = csv.parseLine(row)
        ipMapState.put(start, country)
      }
      fileSource.close()
      csvFinishedFlag.update(true)
    }

    out.collect {
      if (ipMapState.contains(event.userIp)) {
        val details = ipMapState.get(event.userIp)
        event.copy(data =
          event.data.copy(
            ipLocation = Some(details.country)
          ))
      } else {
        event
      }
    }
  }
}

4. Result: It's too hacky and prevents event processing due to blocking file read operation.

Could you tell me what can I do for this situation?

Thanks

What you can do is to implement a custom partitioner, and load a slice of the enrichment data into each partition. There's an example of this approach here ; I'll excerpt some key portions:

The job is organized like this:

DataStream<SensorMeasurement> measurements = env.addSource(new SensorMeasurementSource(100_000));

DataStream<EnrichedMeasurements> enrichedMeasurements = measurements
  .partitionCustom(new SensorIdPartitioner(), measurement -> measurement.getSensorId())
  .flatMap(new EnrichmentFunctionWithPartitionedPreloading());

The custom partitioner needs to know how many partitions there are, and deterministically assigns each event to a specific partition:

private static class SensorIdPartitioner implements Partitioner<Long> {
    @Override
    public int partition(final Long sensorMeasurement, final int numPartitions) {
        return Math.toIntExact(sensorMeasurement % numPartitions);
    }
}

And then the enrichment function takes advantage of knowing how the partitioning was done to load only the relevant slice into each instance:

public class EnrichmentFunctionWithPartitionedPreloading extends RichFlatMapFunction<SensorMeasurement, EnrichedMeasurements> {

    private Map<Long, SensorReferenceData> referenceData;

    @Override
    public void open(final Configuration parameters) throws Exception {
        super.open(parameters);
        referenceData = loadReferenceData(getRuntimeContext().getIndexOfThisSubtask(), getRuntimeContext().getNumberOfParallelSubtasks());
    }


    @Override
    public void flatMap(
            final SensorMeasurement sensorMeasurement,
            final Collector<EnrichedMeasurements> collector) throws Exception {
        SensorReferenceData sensorReferenceData = referenceData.get(sensorMeasurement.getSensorId());
        collector.collect(new EnrichedMeasurements(sensorMeasurement, sensorReferenceData));
    }

    private Map<Long, SensorReferenceData> loadReferenceData(
            final int partition,
            final int numPartitions) {
        SensorReferenceDataClient client = new SensorReferenceDataClient();
        return client.getSensorReferenceDataForPartition(partition, numPartitions);
    }

}

Note that the enrichment is not being done on a keyed stream, so you can not use keyed state or timers in the enrichment function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM