简体   繁体   English

将kafka(kafka-python)转储到txt文件中

[英]Dump the kafka (kafka-python) to a txt file

I need to dump the output of the kafka consumer into an excel file periodically. 我需要定期将kafka使用者的输出转储到excel文件中。 I use the following code: 我使用以下代码:

from kafka import KafkaConsumer
from kafka import KafkaProducer
import json,time
from xlutils.copy import copy    
from xlrd import open_workbook
import pandas

consumer = KafkaConsumer(bootstrap_servers='localhost:9092')
KafkaConsumer()
consumer.subscribe("test")

rowx=0
colx=0

for msg in consumer:
        book_ro = open_workbook("twitter.xls")
        book = copy(book_ro)  # creates a writeable copy
        sheet1 = book.get_sheet(0)  # get a first sheet
        sheet1.write(rowx,colx, msg[6])
        book.save("twitter.xls")

Now, my issue is that the code is not efficient. 现在,我的问题是代码效率不高。 for each message I need to open, write, and then save the excel file. 对于每个消息,我需要打开,编写然后保存excel文件。 Is there any approach to open the excel once, write, and then close it (for a batch of messages and not in the for loop)? 有什么方法可以一次打开excel,编写然后关闭它(对于一批消息,而不是在for循环中)? tnx n

Yes, open,write,save and close with each message is inefficient, you could do it in a batch. 是的,打开,写入,保存和关闭每个消息效率很低,您可以分批完成。 But still need do it in consuming loop. 但是仍然需要在消费循环中做到这一点。

msg_buffer = []
buffer_size = 100
for msg in consumer:
        msg_buffer.append(msg[6])
        if len(msg_buffer) >= buffer_size:
            book_ro = open_workbook("twitter.xls")
            book = copy(book_ro)  # creates a writeable copy
            for _msg in msg_buffer:
                sheet1 = book.get_sheet(0)  # get a first sheet
                sheet1.write(rowx,colx, _msg)
            book.save("twitter.xls")
            msg_buffer = []

You could think that will be 100 times faster than nobatch. 您可能认为这将比nobatch快100倍。

UPDATE for comment: 更新以发表评论:

Yes, Usually we will stay in this loop forever, it internally uses poll to fetch the new message, send heartbeat and commit offset. 是的,通常我们将永远停留在此循环中,它在内部使用轮询来获取新消息,发送心跳并提交偏移量。 And if your aim is consuming message from this topic and save message, it should be a long running loop. 并且,如果您的目标是使用此主题中的消息并保存消息,则这应该是一个长期运行的循环。

This is kafka-python design, you should use like this to consume message or use the consumer.poll(). 这是kafka-python设计,您应该像这样使用消息来使用消息或使用Consumer.poll()。

As for why you could use for msg in consumer: , Because the consumer is an iterator object, its class implements the __iter__ and __next__ , it underlying uses a fetcher to fetch records. 至于为什么可以for msg in consumer:使用for msg in consumer:原因,因为使用者是一个迭代器对象,所以它的类实现__iter____next__ ,它的底层使用提取程序来提取记录。 more implementation details you could refer https://github.com/dpkp/kafka-python/blob/master/kafka/consumer/group.py 您可以参考https://github.com/dpkp/kafka-python/blob/master/kafka/consumer/group.py的更多实现细节

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM