简体   繁体   English

邮件不会从PySpark发送到Kafka

[英]Messages are not sent to Kafka from PySpark

I want to send data to Kafka using Python Kafka connector. 我想使用Python Kafka连接器将数据发送到Kafka。 Everything works fine when I run the code from pyspark shell. 当我从pyspark shell运行代码时,一切正常。 However, when I run it as spark-submit , the messgaes are not sent. 但是,当我将其作为spark-submit运行时,不会发送消息。 There are no errors in the logs and the program execution appears as succeeded. 日志中没有错误,程序执行成功。 But the messages are not sent to Kafka. 但是消息不会发送到Kafka。

import json
import datettime
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='XXX.XX.XXX.XXX:9092')
end = datetime.datetime.now().isoformat()
country = "es"
message = {'country': country, 'end': end, 'status': '1'}
msg = json.dumps(message)
print(msg)
producer.send('testtopic', msg)

I do not understand why it happens. 我不明白为什么会这样。 Below I provide parameters of spark-submit : 下面我提供spark-submit参数:

spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 11g \
--driver-cores 3 \
--num-executors 6 \
--executor-memory 6g \
--executor-cores 2 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.broadcastTimeout=1500 \
--queue t1 \
s3://my-test-bucket/test1/test.py

I had to use producer.flush() after producer.send('testtopic', msg) . 我必须在producer.send('testtopic', msg)之后使用producer.flush() producer.send('testtopic', msg) Only in this case, messages are sent to the Kafka queue, when I run the code with spark-submit . 只有在这种情况下,当我使用spark-submit运行代码时,消息才会发送到Kafka队列。 Otherwise, the messages are not sent. 否则,将不发送消息。

However, it's curious that producer.flush() is not needed when the code is executed from pyspark shell. 但是,奇怪的是,当从pyspark shell执行代码时,不需要producer.flush()

The producer polls for a batch of messages from the batch queue, one batch per partition. 生产者从批处理队列中轮询一批消息,每个分区一批。 A batch is ready when one of the following is true: 如果满足以下任一条件,则批次已准备就绪:

batch.size is reached. 达到batch.size。 Note: Larger batches typically have better compression ratios and higher throughput, but they have higher latency. 注意:较大的批次通常具有更好的压缩率和更高的吞吐量,但是它们具有更高的延迟。

linger.ms (time-based batching threshold) is reached. 达到linger.ms(基于时间的批处理阈值)。 Note: There is no simple guideilne for setting linger.ms values; 注意:设置linger.ms值没有简单的指导。 you should test settings on specific use cases. 您应该在特定的用例上测试设置。 For small events (100 bytes or less), this setting does not appear to have much impact. 对于小事件(100字节或更少),此设置似乎影响不大。

Another batch to the same broker is ready. 准备向同一经纪人的另一批次。

The producer calls flush() or close(). 生产者调用flush()或close()。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM