[英]Kafka producer and consumer not working properly on python with docker
我正在從事一個使用 kafka 生產者和消費者的項目,以便每兩個小時從 news_api 獲取文章(具有特定主題),然后與消費者一起將它們保存在 mongodb 中。
所以我制作了三個類,一個用於 KafkaAdminClient,一個用於 KafkaProducer,一個用於 KafkaConsumer。
我的 kafka 服務器在 docker 容器上運行。 主應用程序是一個 flask 應用程序,這就是我啟動所有線程和 kafka 線程的地方。
我一直在嘗試改變很多小東西,但它似乎非常不穩定,我不知道為什么。 首先,數據在隨機時間到達消費者,最終到達 mongodb。 然后消費者中的舊主題不會被刪除,並且數據庫會不斷填充新舊值。
現在我在消費者中放置了一個組並添加了 kafkaAdminClient class 我根本沒有在消費者中收到消息。 我得到的是:
articleretrieval-flask_api-1 | WARNING:kafka.cluster:Topic health is not available during auto-create initialization articleretrieval-flask_api-1 | WARNING:kafka.cluster:Topic business is not available during auto-create initialization articleretrieval-flask_api-1 | WARNING:kafka.cluster:Topic war is not available during auto-create initialization articleretrieval-flask_api-1
| WARNING:kafka.cluster:Topic motorsport is not available during auto-create initialization articleretrieval-flask_api-1
| WARNING:kafka.cluster:Topic sources is not available during auto-create initialization articleretrieval-flask_api-1
| WARNING:kafka.cluster:Topic science is not available during auto-create initialization articleretrieval-flask_api-1
| WARNING:kafka.cluster:Topic technology is not available during auto-create initialization articleretrieval-flask_api-1
| WARNING:kafka.cluster:Topic education is not available during auto-create initialization articleretrieval-flask_api-1
| WARNING:kafka.cluster:Topic space is not available during auto-create initialization articleretrieval-flask_api-1
| INFO:kafka.consumer.subscription_state:Updated partition assignment: [] articleretrieval-flask_api-1
| INFO:kafka.conn:<BrokerConnection node_id=bootstrap-0 host=kafka:29092 <connected> [IPv4 ('172.19.0.4', 29092)]>: Closing connection.
kafkaConsumerThread.py:
class KafkaConsumerThread:
def __init__(self, topics, db,logger):
self.topics = topics
self.db = db
self.logger = logger
def start(self):
self.logger.debug("Getting the kafka consumer")
try:
consumer = KafkaConsumer(bootstrap_servers=['kafka:29092'],
auto_offset_reset='earliest',
# group_id='my_group',
enable_auto_commit=False,
value_deserializer=lambda x: json.loads(x.decode('utf-8')))
except NoBrokersAvailable as err:
self.logger.error("Unable to find a broker: {0}".format(err))
time.sleep(1)
consumer.subscribe(self.topics + ["sources"])
for message in consumer:
self.logger(message)
if message.topic == "sources":
self.db.insert_source_info(message.value["source_name"], message.value["source_info"])
else:
self.db.insert_article(message.topic, [message.value])
def on_send_success(record_metadata):
return
# print(record_metadata.topic)
# print(record_metadata.partition)
def on_send_error(excp):
print(excp)
def call_apis(self, topics, news_api, media_api):
try:
producer = KafkaProducer(bootstrap_servers=['kafka:29092'],
max_block_ms=100000,
value_serializer=lambda x: json.dumps(x).encode('utf-8'))
except NoBrokersAvailable as err:
# self.logger.error("Unable to find a broker: {0}".format(err))
time.sleep(1)
domains = []
try:
if producer:
for topic in topics:
articles = news_api.get_articles(topic)
for article in articles:
if article['source'] != '':
if article['source'] not in domains:
domains.append(article['source'])
producer.send(topic, value=article).add_callback(on_send_success).add_errback(on_send_error)
producer.flush()
for domain in domains:
source_info = media_api.get_source_domain_info(domain)
if source_info:
producer.send("sources", value={"source_name": domain, "source_info": source_info}).add_callback(on_send_success).add_errback(on_send_error)
# Flush the producer to ensure all messages are sent
producer.flush()
except AttributeError:
self.logger.error("Unable to send message. The producer does not exist.")
class KafkaProducerThread:
def __init__(self, topics,logger):
self.topics = topics
self.news_api = NewsApi()
self.media_api = MediaWikiApi()
self.logger = logger
def start(self):
# Call the APIs immediately when the thread starts
call_apis(self, self.topics, self.news_api, self.media_api)
# Use a timer to schedule the next API call
timer = Timer(7200, self.start)
timer.start()
kafkaAdminClient.py:
class KafkaAdminThread:
def __init__(self,topics):
self.topics = topics
def start(self):
admin_client = KafkaAdminClient(
bootstrap_servers=['kafka:29092'],
client_id='my_client'
)
topic_list = []
for topic in self.topics:
topic_list.append(NewTopic(name=topic, num_partitions=1, replication_factor=1))
admin_client.create_topics(new_topics=topic_list, validate_only=False)
應用程序.py:
if __name__ == "__main__":
# Creating a new connection with mongo
# threading.Thread(target=lambda: app.run(port=8080, host="0.0.0.0",debug=True,use_reloader=False)).start()
executor = ThreadPoolExecutor(max_workers=4)
producerThread = KafkaProducerThread(TOPICS,logging)
adminThread = KafkaAdminThread(TOPICS)
executor.submit(adminThread.start)
flaskThread = threading.Thread(target=lambda: app.run(port=8080, host="0.0.0.0", debug=True, use_reloader=False))
executor.submit(flaskThread.start())
time.sleep(15)
executor.submit(producerThread.start)
consumerThread = KafkaConsumerThread(TOPICS, db,logging)
executor.submit(consumerThread.start)
docker-compose.yml:
zookeeper:
image: wurstmeister/zookeeper
ports:
- "2181:2181"
kafka:
container_name: kafka_broker_1
image: wurstmeister/kafka
links:
- zookeeper
ports:
- "9092:9092"
- "29092:29092"
depends_on:
- zookeeper
environment:
KAFKA_ADVERTISED_HOSTNAME: kafka
KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:29092,OUTSIDE://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
KAFKA_LISTENERS: INSIDE://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
volumes:
- /var/run/docker.sock:/var/run/docker.sock%
flask_api:
build:
context: . #Very important it refers where the root will be for the build.
dockerfile: Dockerfile
links:
- kafka
environment:
- FLASK-KAFKA_BOOTSTRAP-SERVERS=kafka:29092
- SERVER_PORT=8080
ports:
- "8080:8080"
depends_on:
- kafka
消費者中的舊主題不會被刪除
您顯示的代碼中沒有任何內容正在刪除主題。 刪除它們的唯一方法是如果 Kafka 容器重新啟動,因為您沒有為 Kafka 或 Zookeeper 安裝卷來持久化它們。
並且數據庫不斷填充新值和舊值。
我假設您的制作人沒有跟蹤到目前為止閱讀了哪些來源? 如果是這樣,您最終會在主題中出現重復。 我建議使用kafka-console-consumer
來調試生產者是否真的按照你想要的方式工作。
同樣,您已禁用消費者自動提交,而且我沒有看到任何代碼手動提交,因此當消費者重新啟動時,它將重新處理主題中的任何現有數據。 組/AdminClient 設置不應該影響它,但設置組將允許您實際維護偏移量跟蹤。
穩定性方面,之前用過Flask和無線程的Kafka,都很好用。 至少,一個生產者……我的建議是為負責寫入數據庫的消費者制作一個完全獨立的容器。 為此,您不需要 Flask 框架的開銷。 或者,推薦使用 Kafka Connect Mongo sink。
順便說一句,wurstmeister 容器支持通過環境變量自行創建主題。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.