分析來自Kafka消費者的消息

Question

我建立了一個Kafka消費者-生產者系統，我需要處理傳輸的消息。 這些是來自JSON文件的行，例如

ConsumerRecord(topic=u'json_data103052', partition=0, offset=676, timestamp=1542710197257, timestamp_type=0, key=None, value='{"Name": "Simone", "Surname": "Zimbolli", "gender": "Other", "email": "zzz@uiuc.edu", "country": "Nigeria", "date": "11/07/2018"}', checksum=354265828, serialized_key_size=-1, serialized_value_size=189)

我正在尋找一種易於實施的解決方案

定義流窗口
分析窗口中的消息（不重復用戶和類似事物的數量）

有人對如何進行有建議嗎？ 謝謝。

我在使用Spark時遇到問題，所以我寧願避免使用它。 我正在使用Jupyter在Python中編寫腳本。

這是我的代碼：

from kafka import KafkaConsumer
from random import randint
from time import sleep

bootstrap_servers = ['localhost:9092']

%store -r topicName    # Get the topic name from the kafka producer
print topicName

consumer = KafkaConsumer(bootstrap_servers = bootstrap_servers,
                         auto_offset_reset='earliest'
                        )
consumer.subscribe([topicName])

for message in consumer:
    print (message)

Answer 1

我想您需要使用Kafka Streams API。 您具有開窗所需的所有功能。 您可以在此處找到有關Kafka Streams的更多信息：

https://kafka.apache.org/documentation/streams/

Answer 2

對於您的情況，Kafka Streams似乎合適。 它支持以下四種窗口類型：

Tumbling time window - Time-based   Fixed-size, non-overlapping, gap-less windows
Hopping time window- Time-based Fixed-size, overlapping windows
Sliding time window- Time-based Fixed-size, overlapping windows that work on differences between record timestamps
Session window

對於python，有一個庫： https : //github.com/wintoncode/winton-kafka-streams

這對您可能有用。

分析來自Kafka消費者的消息

問題描述

2 個解決方案

解決方案1
1 已采納 2018-11-20 12:32:32

解決方案2
1 2018-11-20 13:08:18

分析來自Kafka消費者的消息

問題描述

2 個解決方案

解決方案1 1 已采納 2018-11-20 12:32:32

解決方案2 1 2018-11-20 13:08:18

解決方案1
1 已采納 2018-11-20 12:32:32

解決方案2
1 2018-11-20 13:08:18