简体   繁体   中英

Find active users events every 60 seconds in Apache Beam

I am trying to find active users for my game applications. In my use-case I have following scenario. The input data source is Kafka topic with message like

{"user_id": 123, "event_id": 1, "utc_ts": 1234345323}
{"user_id": 123, "event_id": 0, "utc_ts": 1234345323}

event_id 1 means login and 0 means logout. My task is to run 60 second window operations and find all the active user records. So, for example,

|Input activity             | output      | Window   |
|user1 -> login -> 12:31:40 | user1 event | 12:31:59 |
|user2 -> login -> 12:31:42 | user2 event | 12:31:59 |
|user3 -> login -> 12:32:13 | user1, user2, user3 event | 12:32:59 |
|user2 -> logout -> 12:33:23| user1, user3 event | 12:33:59 |

Basically my output should be all the active user events in the last one day. Previously, we were able to achieve this with updateStateByKey Spark function, which would look at the timestamp and if the event is new, it will output old + new events. If it's older event, it would output only old events.

I tried to implement this in Beam, but Beam seems to only produce data for current batch. The previous active records are basically not sent to output.

PCollection<KV<Long, String>> userIdAndEvent = readPCollectionOfUserIdAndEvent()
PCollection<KV<Long, String>> output = userIdAndEvent.apply(ParDo.of(new BucketByUserId())
    .apply(Window.into(new GlobalWindows())

output.setCoder(KvCoder.of(VarLongCoder.of(), NullableCoder.of(StringUtf8Coder.of())));

output.apply(ParDo.of(new UpdateStateByUserId()))
    .apply(ParDo.of(new WriteToKafka()))

private static class UpdateStateByUserId extends DoFn<KV<Long, String>, String> {
        private static final String EVENT_STATE = "event_state";
        private static final Long ONE_DAY = 86400L;

        @StateId(EVENT_STATE)
        private final StateSpec<ValueState<String>> eventState = StateSpecs.value(NullableCoder.of(StringUtf8Coder.of()));

        @ProcessElement
        public void process(@Element KV<Long, String> in, OutputReceiver<String> out, @StateId(EVENT_STATE) ValueState<String> eventState) {
            String currentEvent = Optional.ofNullable(eventState.read()).orElse("");
            Long userId = in.getKey();
            JSONObject newEvent = (JSONObject) JSONValue.parse(in.getValue());
            JSONObject stateEvent = (JSONObject) JSONValue.parse(currentEvent);
            Long stateEventTs = extractLong(stateEvent, "utc_ts").orElse(0L);
            Optional<Long> eventTs = extractLong(newEvent, "utc_ts");
            if (eventTs.isPresent()) {
                if (eventTs.get() < stateEventTs) {
                    eventState.write(eventState.read());
                } else if (System.currentTimeMillis() / 1000L - eventTs.get() < ONE_DAY) {
                    eventState.write(newEvent.toJSONString());
                    out.output(newEvent.toJSONString());
                } else {
                    Optional<Long> eventId = extractLong(newEvent, "event_id");
                    if (eventId.isPresent() && eventId.get() == getLoginId()) {
                        eventState.write(newEvent.toJSONString());
                        out.output(newEvent.toJSONString());
                    }
                }
            }
        }
}

I'm not sure where to find answer to this. It's essentially similar to this - finding running total , but couldn't understand responses in that post. This is similar to finding running total but at the finish of window, output running total for all the words, not just the words that came in last window.

The likely reason events are being lost is that there are code paths in your @ProcessElement function which do not call out.output() . If that function is not called, the element will not be output. This code path seems like it would discard late arriving data, no?

if (eventTs.get() < stateEventTs) {

More generally, it seems like you are implementing Apache Beam windowing functions from scratch. Did you try using Apache Beam windowing functions like FixedWindows , SlidingWindows , etc? FixedWindows seems like it would fit your use case quite well. https://beam.apache.org/documentation/programming-guide/#windowing

Note that you want to window on event time , not processing time . You will need to extract utc_ts from the Kafka message as it's being received. Looking at KafkaIO documentation, this appears to be configurable. https://beam.apache.org/releases/javadoc/2.4.0/org/apache/beam/sdk/io/kafka/KafkaIO.Read.html#withTimestampPolicyFactory-org.apache.beam.sdk.io.kafka.TimestampPolicyFactory-

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM