简体   繁体   English

从 Kafka 读取的 Spark Structured Streaming 应用程序仅返回 null 值

[英]Spark Structured Streaming application reading from Kafka return only null values

I plan to extract the data from Kafka using Spark Structured Streaming, but I got empty data.我计划使用 Spark Structured Streaming 从 Kafka 中提取数据,但我得到了空数据。

# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_csv, from_json
from pyspark.sql.types import StringType, StructType

if __name__ == '__main__':
    spark = SparkSession \
        .builder \
        .appName("pyspark_structured_streaming_kafka") \
        .getOrCreate()

    df_raw = spark.read \
        .format("kafka") \
        .option("kafka.bootstrap.servers","52.81.249.81:9092") \
        .option("subscribe","product") \
        .option("kafka.ssl.endpoint.identification.algorithm","") \
        .option("kafka.isolation.level","read_committed") \
        .load()

    df_raw.printSchema()

    product_schema = StructType() \
        .add("product_name", StringType()) \
        .add("product_factory", StringType()) \
        .add("yield_num", StringType()) \
        .add("yield_time", StringType()) 

    df_1=df_raw.selectExpr("CAST(value AS STRING)") \
               .select(from_json("value",product_schema).alias("data")) \
               .select("data.*") \
               .write \
               .format("console") \
               .save()

My test data is the following我的测试数据如下

{
  "product_name": "X Laptop",
  "product_factory": "B-3231",
  "yield_num": 899,
  "yield_time": "20210201 22:00:01"
}

But the result is out of my predication但结果出乎我的预料

./spark-submit ~/Documents/3-Playground/kbatch.py
+------------+---------------+---------+----------+
|product_name|product_factory|yield_num|yield_time|
+------------+---------------+---------+----------+
|        null|           null|     null|      null|
|        null|           null|     null|      null|

The test data was published by the command:测试数据通过命令发布:

./kafka-producer-perf-test.sh --topic product --num-records 90000000 --throughput 5 --producer.config ../config/producer.properties --payload-file ~/Downloads/product.json

If cut away some code just like this如果像这样删掉一些代码

df_1=df_raw.selectExpr("CAST(value AS STRING)") \
               .writeStream \
               .format("console") \
               .outputMode("append") \
               .option("checkpointLocation","file:///Users/picomy/Kafka-Output/checkpoint") \
               .start() \
               .awaitTermination() 

The result is the following结果如下

Batch: 3130
-------------------------------------------
+--------------------+
|               value|
+--------------------+
|    "yield_time":...|
|    "product_name...|
|    "yield_num": ...|
|    "product_fact...|
|    "yield_num": ...|
|    "yield_num": ...|
|    "product_fact...|
|    "product_fact...|
|    "product_name...|
|    "product_fact...|
|    "product_name...|
|                   }|
|    "yield_time":...|
|    "product_name...|
|                   }|
|    "product_fact...|
|    "yield_num": ...|
|    "product_fact...|
|    "yield_time":...|
|    "product_name...|
+--------------------+

I don't know where is the problem's root cause.我不知道问题的根本原因在哪里。

There are few things causing your code not to be working correct:有几件事会导致您的代码无法正常工作:

  • Wrong schema (the field yield_num is an integer/long)错误的模式(字段 yield_num 是整数/长整数)
  • Using writeStream instead of just write (if you want streaming)使用 writeStream 而不是只写(如果你想要流式传输)
  • Start and awaitTermination of the streaming query流式查询的开始和等待终止
  • The data in your json file should be stored in one line only json 文件中的数据应仅存储在一行中

You can replace parts of your code with the following snippet:您可以使用以下代码段替换部分代码:

from pyspark.sql.types import StringType, StructType, LongType

    product_schema = StructType() \
        .add("product_name", StringType()) \
        .add("product_factory", StringType()) \
        .add("yield_num", LongType()) \
        .add("yield_time", StringType()) 

    df_1=df_raw.selectExpr("CAST(value AS STRING)") \
               .select(from_json("value",product_schema).alias("data")) \
               .select("data.*") \
               .writeStream \
               .format("console") \
               .start()
               .awaitTermination()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从多个 Kafka 主题读取 Spark 结构化流应用程序 - Spark structured streaming app reading from multiple Kafka topics Scala:从火花结构化流中读取 Kafka Avro 消息时出错 - Scala: Error reading Kafka Avro messages from spark structured streaming 如何使用 (Py)Spark Structured Streaming 定义带有时间戳的 JSON 记录的架构(来自 Kafka)? - 显示 null 值 - How to define schema for JSON records with timestamp (from Kafka) using (Py)Spark Structured Streaming? - null values shown 来自Kafka检查点和确认的Spark结构化流 - Spark structured streaming from Kafka checkpoint and acknowledgement 从 Kafka 到 Elastic Search 的 Spark 结构化流 - Spark Structured Streaming from Kafka to Elastic Search 从Kafka倒转偏移Spark结构化流 - Rewind Offset Spark Structured Streaming from Kafka 清除偏移量激发来自 kafka 的结构化流 - Clear offsets spark structured streaming from kafka Spark Structured Streaming 从具有多个读取流的多个 Kafka 主题读取 - Spark Structured Streaming reading from multiple Kafka topics with multiple read streams 使用 Spark Structured Streaming 从 Kafka 主题读取:可以由 Spark 解析发布到 Kafka 主题的多行 JSON 吗? - Reading from Kafka topic using Spark Structured Streaming: Can multi-line JSON published to Kafka topic be parsed by Spark? 从 Kafka 主题读取流时,Spark Structured Streaming 是否存在超时问题? - Does Spark Structured Streaming have some timeout issue when reading streams from a Kafka topic?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM