Kafka 生產者和消費者在 python 和 docker 上無法正常工作

Question

我正在從事一個使用 kafka 生產者和消費者的項目，以便每兩個小時從 news_api 獲取文章（具有特定主題），然后與消費者一起將它們保存在 mongodb 中。

所以我制作了三個類，一個用於 KafkaAdminClient，一個用於 KafkaProducer，一個用於 KafkaConsumer。

我的 kafka 服務器在 docker 容器上運行。 主應用程序是一個 flask 應用程序，這就是我啟動所有線程和 kafka 線程的地方。

我一直在嘗試改變很多小東西，但它似乎非常不穩定，我不知道為什么。 首先，數據在隨機時間到達消費者，最終到達 mongodb。 然后消費者中的舊主題不會被刪除，並且數據庫會不斷填充新舊值。

現在我在消費者中放置了一個組並添加了 kafkaAdminClient class 我根本沒有在消費者中收到消息。 我得到的是：

articleretrieval-flask_api-1          | WARNING:kafka.cluster:Topic health is not available during auto-create initialization articleretrieval-flask_api-1          | WARNING:kafka.cluster:Topic business is not available during auto-create initialization articleretrieval-flask_api-1          | WARNING:kafka.cluster:Topic war is not available during auto-create initialization articleretrieval-flask_api-1         
 | WARNING:kafka.cluster:Topic motorsport is not available during auto-create initialization articleretrieval-flask_api-1          
| WARNING:kafka.cluster:Topic sources is not available during auto-create initialization articleretrieval-flask_api-1         
 | WARNING:kafka.cluster:Topic science is not available during auto-create initialization articleretrieval-flask_api-1         
 | WARNING:kafka.cluster:Topic technology is not available during auto-create initialization articleretrieval-flask_api-1         
 | WARNING:kafka.cluster:Topic education is not available during auto-create initialization articleretrieval-flask_api-1          
| WARNING:kafka.cluster:Topic space is not available during auto-create initialization articleretrieval-flask_api-1          
| INFO:kafka.consumer.subscription_state:Updated partition assignment: [] articleretrieval-flask_api-1    
| INFO:kafka.conn:<BrokerConnection node_id=bootstrap-0 host=kafka:29092 <connected> [IPv4 ('172.19.0.4', 29092)]>: Closing connection.

kafkaConsumerThread.py：


class KafkaConsumerThread:
    def __init__(self, topics, db,logger):
        self.topics = topics
        self.db = db
        self.logger = logger

    def start(self):
        self.logger.debug("Getting the kafka consumer")
        try:
            consumer = KafkaConsumer(bootstrap_servers=['kafka:29092'],
                                     auto_offset_reset='earliest',
                                    #  group_id='my_group',
                                     enable_auto_commit=False,
                                     value_deserializer=lambda x: json.loads(x.decode('utf-8')))
        except NoBrokersAvailable as err:
            self.logger.error("Unable to find a broker: {0}".format(err))
            time.sleep(1)
        consumer.subscribe(self.topics + ["sources"])

        for message in consumer:
            self.logger(message)
            if message.topic == "sources":
                self.db.insert_source_info(message.value["source_name"], message.value["source_info"])
            else:
                self.db.insert_article(message.topic, [message.value])



def on_send_success(record_metadata):
    return
    # print(record_metadata.topic)
    # print(record_metadata.partition)

def on_send_error(excp):
    print(excp)

def call_apis(self, topics, news_api, media_api):
    try:
        producer = KafkaProducer(bootstrap_servers=['kafka:29092'],
                                 max_block_ms=100000,
                                 value_serializer=lambda x: json.dumps(x).encode('utf-8'))
    except NoBrokersAvailable as err:
        # self.logger.error("Unable to find a broker: {0}".format(err))
        time.sleep(1)

    domains = []
    try:
        if producer:
            for topic in topics:
                articles = news_api.get_articles(topic)
                for article in articles:
                    if article['source'] != '':
                        if article['source'] not in domains:
                            domains.append(article['source'])

                        producer.send(topic, value=article).add_callback(on_send_success).add_errback(on_send_error)
                        producer.flush()
            for domain in domains:
                source_info = media_api.get_source_domain_info(domain)
                if source_info:
                    producer.send("sources", value={"source_name": domain, "source_info": source_info}).add_callback(on_send_success).add_errback(on_send_error)

                    # Flush the producer to ensure all messages are sent
                    producer.flush()
    except AttributeError:
        self.logger.error("Unable to send message. The producer does not exist.")



class KafkaProducerThread:
    def __init__(self, topics,logger):
        self.topics = topics
        self.news_api = NewsApi()
        self.media_api = MediaWikiApi()
        self.logger = logger

    def start(self):
        # Call the APIs immediately when the thread starts
        call_apis(self, self.topics, self.news_api, self.media_api)

        # Use a timer to schedule the next API call
        timer = Timer(7200, self.start)
        timer.start()

kafkaAdminClient.py:


class KafkaAdminThread:
    def __init__(self,topics):
        self.topics = topics

    def start(self):
        admin_client = KafkaAdminClient(
            bootstrap_servers=['kafka:29092'], 
                client_id='my_client'
        )
        topic_list = []
        for topic in self.topics:
            topic_list.append(NewTopic(name=topic, num_partitions=1, replication_factor=1))
        admin_client.create_topics(new_topics=topic_list, validate_only=False)

應用程序.py:

if __name__ == "__main__":
    # Creating a new connection with mongo
    # threading.Thread(target=lambda: app.run(port=8080, host="0.0.0.0",debug=True,use_reloader=False)).start()
    executor = ThreadPoolExecutor(max_workers=4)
    producerThread = KafkaProducerThread(TOPICS,logging)
    adminThread = KafkaAdminThread(TOPICS)
    executor.submit(adminThread.start)
    flaskThread = threading.Thread(target=lambda: app.run(port=8080, host="0.0.0.0", debug=True, use_reloader=False))
    executor.submit(flaskThread.start())
    time.sleep(15)
    executor.submit(producerThread.start)
    consumerThread = KafkaConsumerThread(TOPICS, db,logging)
    executor.submit(consumerThread.start)

docker-compose.yml：

zookeeper:
    image: wurstmeister/zookeeper
    ports:
      - "2181:2181"

  kafka:
    container_name: kafka_broker_1
    image: wurstmeister/kafka
    links:
      - zookeeper
    ports:
      - "9092:9092"
      - "29092:29092"
    depends_on:
      - zookeeper
    environment:
      KAFKA_ADVERTISED_HOSTNAME: kafka
      KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:29092,OUTSIDE://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
      KAFKA_LISTENERS: INSIDE://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock%

  flask_api:
    build:
      context: . #Very important it refers where the root will be for the build.
      dockerfile: Dockerfile
    links:
        - kafka
    environment:
      - FLASK-KAFKA_BOOTSTRAP-SERVERS=kafka:29092
      - SERVER_PORT=8080
    ports:
      - "8080:8080"
    depends_on:
      - kafka

Answer 1

消費者中的舊主題不會被刪除

您顯示的代碼中沒有任何內容正在刪除主題。 刪除它們的唯一方法是如果 Kafka 容器重新啟動，因為您沒有為 Kafka 或 Zookeeper 安裝卷來持久化它們。

並且數據庫不斷填充新值和舊值。

我假設您的制作人沒有跟蹤到目前為止閱讀了哪些來源？ 如果是這樣，您最終會在主題中出現重復。 我建議使用kafka-console-consumer來調試生產者是否真的按照你想要的方式工作。
同樣，您已禁用消費者自動提交，而且我沒有看到任何代碼手動提交，因此當消費者重新啟動時，它將重新處理主題中的任何現有數據。 組/AdminClient 設置不應該影響它，但設置組將允許您實際維護偏移量跟蹤。

穩定性方面，之前用過Flask和無線程的Kafka，都很好用。 至少，一個生產者……我的建議是為負責寫入數據庫的消費者制作一個完全獨立的容器。 為此，您不需要 Flask 框架的開銷。 或者，推薦使用 Kafka Connect Mongo sink。

順便說一句，wurstmeister 容器支持通過環境變量自行創建主題。

Kafka 生產者和消費者在 python 和 docker 上無法正常工作

問題描述

1 個解決方案

解決方案1
0 2023-01-18 15:14:36

Kafka 生產者和消費者在 python 和 docker 上無法正常工作

問題描述

1 個解決方案

解決方案1 0 2023-01-18 15:14:36

解決方案1
0 2023-01-18 15:14:36