如何从kafka主题中读取json字符串到pyspark dataframe？

Question

我正在尝试将来自 Kafka 主题的 json 消息读取到 PySpark dataframe 中。 我的第一反应是这样的：

consumer = KafkaConsumer(TOPIC_NAME,
                             consumer_timeout_ms=9000,
                             bootstrap_servers=BOOTSTRAP_SERVER,
                             auto_offset_reset='earliest',
                             enable_auto_commit=True,
                             group_id=str(uuid4()),
                             value_deserializer=lambda x: x.decode("utf-8"))
message_lst = []
    for message in consumer:
        message_str = message.value.replace('\\"', "'").replace("\n", "").replace("\r", "")
        message_dict = json.loads(message_str)
        message_lst.append(message_dict)

    messages_json = sc.parallelize(message_lst)
    messages_df = sqlContext.read.json(messages_json)

我想知道有没有办法使用 Spark 结构化流或类似的东西来获得相同的 dataframe。 有人可以帮忙吗？ UPD：我对结构化流的尝试是这样的：

df = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", f"{BOOTSTRAP_SERVER}") \
        .option("subscribe", TOPIC_NAME) \
        .load()

它退出并出现以下错误： pyspark.sql.utils.AnalysisException: Failed to find data source: Kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide". pyspark.sql.utils.AnalysisException: Failed to find data source: Kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide". UPD：我阅读了异常文本中说明的指南，它建议安装此库“spark-sql-kafka-0-10_2.12”，但我找不到。 有人知道吗？ UPD 2：我设法添加了所需的 package 并尝试读取来自 kafka 的消息：

df = spark \
...         .readStream \
...         .format("kafka") \
...         .option("kafka.bootstrap.servers", f"{BOOTSTRAP_SERVER}") \
...         .option("subscribe", TOPIC_NAME) \
...         .load()
df.writeStream.outputMode("append").format("console").start().awaitTermination()

我使用与以前相同的消费者。 这里的问题是它只读取在 start() 调用之后写入的消息。 如何读取在给定时间写入的所有消息并获得 dataframe 的结果？ 另外，任何人都可以举一个 load_json() 模式的例子吗？ 如果我的问题很愚蠢，我很抱歉，但我在 Python 中找不到任何示例。

Answer 1

如主文档中所述，您缺少 kafka package

./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 ...

确保此处列出的 3.1.2 与您自己的 Spark 版本匹配

如何从kafka主题中读取json字符串到pyspark dataframe？

问题描述

1 个解决方案

解决方案1
1 2021-06-07 03:49:58

如何从kafka主题中读取json字符串到pyspark dataframe？

问题描述

1 个解决方案

解决方案1 1 2021-06-07 03:49:58

解决方案1
1 2021-06-07 03:49:58