简体   繁体   中英

Messages with expired event time causing java heap space OutOfMemory error

I am running a job with a normal tumbling event time window (window size of 1 hour). After running long enough, it will throw an error about the java heap running out of space. Now the thing about the data that is being processed is that there will be one message that occurs today at noon and the next 15k or so will be from a week prior (this isn't how the data is expected to always be, but it should be handled either way). So the watermark is well past the time of the event times of the next 15k messages, even with allowable lateness, so the late messages should be discarded. Or at least that is what I thought since they are no longer in that window.

So my question is this. Does Flink maintain messages that are expired even though they aren't used by the window? Or is it just for their tumbling window and there is something else or some property I should be setting to make sure that expired data doesn't eat up memory?

Thanks for the help!

EDIT

DataStream<OutputObject> outputStream = sourceData
    .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Record>(Time.minutes(1)) {
        @Override
        public long extractTimestamp(Record record) {
            long eventTimeFromRecord = record.eventTimestamp;

            return eventTimeFromRecord;
        }
    })
    .keyBy("fieldToKeyBy")
    .window(TumblingEventTimeWindows.of(Time.hours(1)))
    .apply(new ApplyFunction());

When a source has parallelism of n, then there are n watermarks -- one for each of the parallel sub-tasks. In a case where a Flink job receives one message timestamped today at noon, followed by many events from a week ago, that one message will only advance the watermark for one of the parallel tasks, and the other n-1 tasks will still have Long.min_value as their watermark. So those "late" events will only be recognized as late in one of the parallel window operators, and the other n-1 windows will go ahead and process those "late" events.

Note this could also happen if you had just restored from a checkpoint or savepoint, because watermarks aren't saved in checkpoints or savepoints. This means you can't count on message traffic from a previous job to have brought the watermarks up to date.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM